Table of Contents
Lectures on Optimization¶
Welcome to Lectures on Optimization, a structured set of lecture notes designed to build the mathematical foundations required to understand, analyze, and implement modern optimization algorithms, with a strong focus on convex analysis and scalable algorithmic methods central to machine learning.
These notes are inspired by and draw heavily on material from:
- Convex Optimization— Stephen Boyd & Lieven Vandenberghe Cambridge University Press, 2004 link
- Stanford EE364A: Convex Optimization I - Stephen Boyd (2023) link
- Optimization: The University of Texas at Austin — Spring 2020 link
The goal is not to reproduce the content from these courses, but to distill, reorganize, and extend its core ideas into a machine-learning-aligned, optimization-first framework, prioritizing:
- Convex problem structure, geometry, and theoretical principles
- Algorithmic design, convergence behavior, and computational efficiency
- Scalable and practical methods used in ML, deep learning, and NLP optimization
These lectures develop intuition, mathematical rigor, and practical algorithmic insight across major areas including: - Convex sets, convex functions, and problem modeling - Gradient, subgradient, Newton, and proximal methods - Constrained optimization, duality, and KKT theory - Decomposition, interior-point methods, and path-following algorithms - Large-scale optimization strategies that power real-world machine learning systems
Case-studies
I - Transportation Optimization: A Linear Programming Case Study.¶
We consider a classical transportation optimization problem, also known as the minimum-cost flow problem, arising in supply chain and logistics planning. A set of production facilities (plants) must ship goods to a set of customer orders at minimum total transportation cost, subject to supply, demand, and routing constraints.
Formulation¶
We assume the following information is known:
- Supply capacity at each plant
- Demand requirement for each order
- Transportation cost between feasible plant order pairs
- Feasible shipping lanes (not all plants can serve all orders)
Question - determine how much product to ship from each plant to each order in order to minimize total transportation cost while fully satisfying demand and respecting supply limits
Objective Function¶
- Minimize total transportation cost: $\min \sum_{i \in P} \sum_{j \in O} c_{ij} x_{ij}$
Decision Variables¶
- $x_{ij} \ge 0$: quantity shipped from plant $i$ to order $j$
Parameters¶
- $c_{ij}$: unit transportation cost from plant $i$ to order $j$
- $s_i$: supply capacity of plant $i$
- $d_j$: demand requirement of order $j$
- $A \subseteq P \times O$: set of feasible shipping lanes
Constraints¶
- Supply constraints (plant capacity): $\sum_{j \in O} x_{ij} \le s_i \quad \forall i \in P$
- Demand constraints (order fulfillmen): $\sum_{i \in P} x_{ij} = d_j \quad \forall j \in O$
- Route feasibility constraints: $x_{ij} = 0 \quad \forall (i,j) \notin A$
- Non-negativity: $x_{ij} \ge 0 \quad \forall i,j$
This problem belongs to the class of linear programming problems, characterized by continuous decision variables, convex feasible regions, and polynomial-time solvability.
#!pip uninstall -y pandas numpy matplotlib scipy
#!pip install --no-cache-dir numpy pandas matplotlib scipy
import re
from pathlib import Path
from typing import Dict, Tuple, Optional, List
import numpy as np
import pandas as pd
import cvxpy as cp
from scipy import sparse
import matplotlib.pyplot as plt
from matplotlib.colors import PowerNorm
rng = np.random.default_rng(42)
np.random.seed(42)
#!pip install cvxpy[glpk]
#!pip install ecos
Dataset Source¶
The data used in this notebook comes from the Supply Chain Logistics Problem Dataset, originally published by researchers at Brunel University London on Figshare.
Source details:
- Title: Supply Chain Logistics Problem Dataset
- Publisher: Brunel University London
- Link
Dataset Structure¶
The dataset provides a complete snapshot of a multi-node supply chain, organized into several tables, including:
- OrderList – customer orders that must be fulfilled
- Plants / Facilities – supply locations with capacity limits
- Shipping Lanes / Routes – feasible connections between plants and orders
- Cost & Attributes – transportation cost and lane-specific properties
Together, these tables allow us to construct a realistic transportation network with real-world constraints.
Data description¶
Plants (
plants)plant_id: unique identifier of a plantsupply_cap: maximum supply capacity available at the plant
Orders (
orders)order_id: unique identifier of an orderdemand: required demand that must be satisfied
Lanes (
lanes)- Each row represents a feasible shipping lane from a plant to an order
plant_id,order_id: endpoints of the laneunit_cost: per-unit shipping cost on that lane
After preprocessing:
- all plants and orders included in the model have at least one feasible lane
- total supply is sufficient to meet total demand
# Data Cleaning Utilities
def canon(s: str) -> str:
"""Canonical column name: lowercase, strip, replace non-alnum with underscore."""
s = s.strip().lower()
s = re.sub(r"[^a-z0-9]+", "_", s)
s = re.sub(r"_+", "_", s).strip("_")
return s
def clean_df(df: pd.DataFrame) -> pd.DataFrame:
df = df.copy()
df.columns = [canon(c) for c in df.columns]
# Normalize common missing value tokens
df = df.replace({"": np.nan, "NA": np.nan, "N/A": np.nan, "null": np.nan})
return df
def load_excel_sheets(xlsx_path: Path) -> Dict[str, pd.DataFrame]:
print(f"Loading workbook: {xlsx_path.resolve()}")
xls = pd.ExcelFile(xlsx_path)
sheets = {}
for sheet_name in xls.sheet_names:
df = pd.read_excel(xls, sheet_name=sheet_name)
df = clean_df(df)
sheets[canon(sheet_name)] = df
print(f"Loaded {len(sheets)} sheets: {list(sheets.keys())}")
return sheets
xlsx_path = Path("data/SupplyChainLogisticsProblems.xlsx")
sheets = load_excel_sheets(xlsx_path)
Loading workbook: C:\Users\salmank\Documents\convex_optimization\docs\convex\tutorials\data\SupplyChainLogisticsProblems.xlsx Loaded 7 sheets: ['orderlist', 'freightrates', 'whcosts', 'whcapacities', 'productsperplant', 'vmicustomers', 'plantports']
# Demand
orders = sheets['orderlist'][['order_id', 'weight']]
orders.columns = ["order_id", "demand"]
orders = orders[orders["demand"] > 0].copy()
orders["order_id"] = orders["order_id"].astype(str)
# Retain top 1000 orders by demand
k = 1000
orders = (
orders
.sort_values("demand", ascending=False)
.head(k)
.reset_index(drop=True)
)
print(f"Orders retained: {len(orders)}")
orders.head()
# Total Demand capacity
total_demand = orders['demand'].sum()
print(f"Total orders: {len(orders)}")
print(f"Total demand capacity: {total_demand}")
#orders.head()
Orders retained: 1000 Total orders: 1000 Total demand capacity: 129698.9734391597
# Plant capacities
plants = sheets['whcapacities']
plants.columns = ["plant_id", "supply_cap"]
plants["plant_id"] = plants["plant_id"].astype(str)
plants["supply_cap"] = pd.to_numeric(plants["supply_cap"], errors="coerce")
plants = plants.dropna(subset=["supply_cap"])
plants["supply_cap"] = plants["supply_cap"].astype(float) * 30
# Retain top 10 plants by supply capacity
plants = (
plants
.sort_values("supply_cap", ascending=False)
.head(10)
.reset_index(drop=True)
)
# Total Supply capacity
total_supply = plants["supply_cap"].sum()
print(f"Total plants: {len(plants)}")
print(f"Total supply capacity: {total_supply}")
#plants.head()
Total plants: 10 Total supply capacity: 159720.0
# Check Supply is enough to meet Demand
if total_supply < total_demand:
raise ValueError(f"Total supply ({total_supply}) is less than total demand ({total_demand}). Problem is infeasible.")
# ----------------------------
# Build lane table (arc list) and unit costs
# ----------------------------
# NOTE: In this notebook we generate a mostly-dense plant×order lane set
# with dummy unit costs, but randomly skip ~x% of connections
# to mimic missing/infeasible shipping lanes in real data.
rng = np.random.default_rng(7)
plant_factor = {pid: 0.8 + 0.4 * rng.random() for pid in plants["plant_id"]}
plant_ids = plants["plant_id"].tolist()
skip_prob = 0.5 # probability of skipping a plant-order connection
lanes = []
for _, o in orders.iterrows():
oid = o["order_id"]
dem = float(o["demand"])
order_jitter = 0.75 + 0.25 * rng.random()
for pid in plant_ids:
# randomly skip some connections
if rng.random() < skip_prob:
continue
unit_cost = 1 * plant_factor[pid] * order_jitter
lanes.append((pid, oid, unit_cost))
lanes = pd.DataFrame(lanes, columns=["plant_id", "order_id", "unit_cost"])
print(f"Total lanes (plant-order pairs): {len(lanes)}")
print(f"Total lanes (plant-order pairs) if we had full connectivity: {len(plants) * len(orders)}")
Total lanes (plant-order pairs): 5048 Total lanes (plant-order pairs) if we had full connectivity: 10000
# Drop plants / orders with no lanes
print(
f"Before cleanup: {len(plants)} plants, "
f"{len(orders)} orders, "
f"{len(lanes)} lanes"
)
plants_with_lane = set(lanes["plant_id"].unique())
orders_with_lane = set(lanes["order_id"].unique())
# Identify disconnected nodes
drop_plants = set(plants["plant_id"]) - plants_with_lane
drop_orders = set(orders["order_id"]) - orders_with_lane
if drop_plants:
print(f"Dropping {len(drop_plants)} plant(s) with no lanes: {sorted(drop_plants)}")
plants_with_lanes = plants[~plants["plant_id"].isin(drop_plants)].reset_index(drop=True)
else:
plants_with_lanes = plants.copy()
if drop_orders:
print(f"Dropping {len(drop_orders)} order(s) with no lanes: {sorted(drop_orders)}")
orders_with_lanes = orders[~orders["order_id"].isin(drop_orders)].reset_index(drop=True)
else:
orders_with_lanes = orders.copy()
# Keep lanes consistent with remaining plants and orders
lanes = lanes[
lanes["plant_id"].isin(plants_with_lanes["plant_id"]) &
lanes["order_id"].isin(orders_with_lanes["order_id"])
].reset_index(drop=True)
print(
f"After cleanup: {len(plants_with_lanes)} plants, "
f"{len(orders_with_lanes)} orders, "
f"{len(lanes)} lanes"
)
Before cleanup: 10 plants, 1000 orders, 5048 lanes Dropping 1 order(s) with no lanes: ['1447183486.7'] After cleanup: 10 plants, 999 orders, 5048 lanes
# ----------------------------
# Data integrity checks
# ----------------------------
# Ensure IDs are unique before we rely on index-based alignment later.
if not plants_with_lanes["plant_id"].is_unique:
raise ValueError("plants.plant_id is not unique. Deduplicate or aggregate supplies first.")
if not orders_with_lanes["order_id"].is_unique:
raise ValueError("orders.order_id is not unique. Deduplicate or aggregate demands first.")
# ----------------------------
# Clean / validate lanes
# ----------------------------
# 1) Drop lanes whose plant_id/order_id do not exist in the node tables.
# 2) If duplicate (plant_id, order_id) lanes exist, aggregate them (here: keep the minimum unit cost).
# Keep only lanes that connect to known plants/orders (avoid silent mismatches)
lanes = lanes.merge(plants[["plant_id"]], on="plant_id", how="inner")
lanes = lanes.merge(orders[["order_id"]], on="order_id", how="inner")
if lanes.empty:
raise ValueError("No valid lanes after matching plant_id/order_id against plants/orders.")
lanes = (lanes.groupby(["plant_id", "order_id"], as_index=False).agg({'unit_cost': "min"}))
#lanes.head()
# ----------------------------
# Define a *single source of truth* for indices
# ----------------------------
plant_ids = plants_with_lanes["plant_id"].to_numpy()
order_ids = orders_with_lanes["order_id"].to_numpy()
plant_to_i = {pid: i for i, pid in enumerate(plant_ids)}
order_to_j = {oid: j for j, oid in enumerate(order_ids)}
# Arc index = row index of lanes after reset
lanes = lanes.reset_index(drop=True)
nA = len(lanes)
nP = len(plant_ids)
nO = len(order_ids)
# Build arc endpoint index arrays
arc_i = lanes["plant_id"].map(plant_to_i).to_numpy()
arc_j = lanes["order_id"].map(order_to_j).to_numpy()
arc_i = arc_i.astype(int)
arc_j = arc_j.astype(int)
# ----------------------------
# Build aligned vectors: cost c, supply, demand
# ----------------------------
# IMPORTANT: `c[a]` must correspond to the same arc as decision variable `x[a]`.
# Here, arc index a = row index of `lanes` after reset_index.
# Cost vector corresponds to arc order (lanes row order).
c = lanes['unit_cost'].to_numpy(dtype=float)
supply = plants.set_index("plant_id").loc[plant_ids, 'supply_cap'].to_numpy(dtype=float)
demand = orders.set_index("order_id").loc[order_ids, 'demand'].to_numpy(dtype=float)
# Ensure every order has at least one incoming lane
incoming_counts = np.bincount(arc_j, minlength=nO)
missing_orders = order_ids[incoming_counts == 0]
if len(missing_orders) > 0:
raise ValueError(f"Orders with no incoming lanes: {missing_orders[:10]}{'...' if len(missing_orders) > 10 else ''}")
# ----------------------------
# Visuals: supply capacity by plant & demand by order
# ----------------------------
fig, ax = plt.subplots(figsize=(10, 4))
ax.bar(np.arange(len(supply)), supply)
ax.set_title("Supply capacity by plant (index order)")
ax.set_xlabel("Plant index i")
ax.set_ylabel("Supply capacity")
plt.show()
fig, ax = plt.subplots(figsize=(10, 4))
ax.bar(np.arange(len(demand)), demand)
ax.set_title("Demand by order (index order)")
ax.set_xlabel("Order index j")
ax.set_ylabel("Demand")
plt.show()
# ----------------------------
# Build sparse incidence matrices
# ----------------------------
# A_order[j, a] = 1 if arc a goes into order j
A_order = sparse.coo_matrix(
(np.ones(nA), (arc_j, np.arange(nA))),
shape=(nO, nA)
).tocsr()
# A_plant[i, a] = 1 if arc a goes out of plant i
A_plant = sparse.coo_matrix(
(np.ones(nA), (arc_i, np.arange(nA))),
shape=(nP, nA)
).tocsr()
A_order
<Compressed Sparse Row sparse matrix of dtype 'float64' with 5048 stored elements and shape (999, 5048)>
CVXPY Solver¶
# ----------------------------
# Solve the LP in CVXPY
# ----------------------------
# Decision variable:
# x[a] = shipment quantity on lane (arc) a
x = cp.Variable(nA, nonneg=True)
# Constraints:
# - every order demand must be met exactly
# - plant shipments cannot exceed supply capacity
constraints = [
A_order @ x == demand,
A_plant @ x <= supply
]
# Objective: min total shipping cost
objective = cp.Minimize(c @ x)
prob = cp.Problem(objective, constraints)
# Choose a good LP solver explicitly (fallback if unavailable)
preferred = "HIGHS"
installed = set(cp.installed_solvers())
solver = preferred if preferred in installed else ("GLPK" if "GLPK" in installed else None)
if solver is None:
raise RuntimeError(f"No suitable LP solver found. Installed solvers: {sorted(installed)}")
_ = prob.solve(solver=solver, verbose=False)
print("Status:", prob.status)
print("Objective value:", prob.value)
if prob.status not in ("optimal", "optimal_inaccurate"):
raise RuntimeError(f"Solve failed: status={prob.status}")
# Attach the solution back to the lane table (arc ordering!)
lanes_sol = lanes.copy()
lanes_sol["x"] = np.asarray(x.value).reshape(-1)
# ----------------------------
# Sanity checks (residuals)
# ----------------------------
order_resid = (A_order @ lanes_sol["x"].to_numpy()) - demand
plant_resid = (A_plant @ lanes_sol["x"].to_numpy()) - supply # should be <= 0
diagnostics = {
"status": prob.status,
"objective": float(prob.value),
"max_abs_order_residual": float(np.max(np.abs(order_resid))) if len(order_resid) else 0.0,
"max_plant_violation": float(np.max(plant_resid)) if len(plant_resid) else 0.0,
"solver_used": solver,
"n_plants": int(nP),
"n_orders": int(nO),
"n_lanes": int(nA),
}
diagnostics
Status: optimal Objective value: 113265.87945592123
{'status': 'optimal',
'objective': 113265.87945592123,
'max_abs_order_residual': 2.808064891723916e-11,
'max_plant_violation': 2.546585164964199e-11,
'solver_used': 'GLPK',
'n_plants': 10,
'n_orders': 999,
'n_lanes': 5048}
# ----------------------------
# Heatmap of shipped amounts (plants x orders)
# ----------------------------
# lanes_sol must contain: plant_id, order_id, x
# plant_ids, order_ids are the index orders used elsewhere (strings)
# Build shipment matrix: rows=plants, cols=orders
ship = (lanes_sol
.pivot_table(index="plant_id", columns="order_id", values="x", aggfunc="sum", fill_value=0.0)
)
# Ensure full row/col order (so it matches your indexing)
ship = ship.reindex(index=plant_ids, columns=order_ids, fill_value=0.0)
order_totals = ship.sum(axis=0).sort_values(ascending=False)
ship = ship[order_totals.index]
plant_totals = ship.sum(axis=1).sort_values(ascending=False)
#ship = ship.loc[plant_totals.index]
X = ship.to_numpy()
plt.figure(figsize=(18, 5))
im = plt.imshow(
X,
aspect="auto",
cmap="viridis",
norm=PowerNorm(gamma=0.3)
)
plt.title("Shipment heatmap")
plt.xlabel("Orders (Order ID)")
plt.ylabel("Plants (Plant ID)")
plt.yticks(np.arange(len(ship.index)), ship.index)
#step = max(1, len(ship.columns)//50)
step = 20
xt = np.arange(0, len(ship.columns), step)
plt.xticks(xt, ship.columns[xt], rotation=90)
plt.colorbar(im, label="Shipped quantity (power-law scaled)")
plt.tight_layout()
plt.show()
II - Mean–Variance Portfolio Optimization: A Pareto-Optimal Case Study.¶
Modern Portfolio Theory provides a mathematical framework for constructing investment portfolios that balance expected return against risk, where risk is measured by the variance (or standard deviation) of portfolio returns. This framework—commonly referred to as mean–variance analysis—was introduced by Harry Markowitz and forms the foundation of quantitative portfolio construction.
- Mean–Variance Optimization: Select portfolio weights to trade off expected return against return variance.
- Efficient Frontier: The set of portfolios that achieve the maximum expected return for a given level of risk, or equivalently, the minimum risk for a given expected return.
- Pareto Optimality: A portfolio is Pareto-optimal if no other feasible portfolio offers higher return at the same level of risk.
In this case study, we analyze a diversified universe of liquid financial assets using publicly available historical market data.
- Assets consist of widely traded exchange-traded funds (ETFs) spanning multiple asset classes
- Historical price data covers several years
- Expected returns and the covariance matrix are estimated from historical returns
Formulation¶
We seek to determine portfolio weights that optimally balance expected return and risk, subject to realistic investment constraints commonly encountered in practice.
Objective Function¶
Two equivalent formulations are commonly used:
Minimize portfolio variance for a target return: $\min \; w^\top \Sigma w$
Maximize risk-adjusted return: $\max \; \mu^\top w - \lambda \, w^\top \Sigma w$
where:
- $w$ is the vector of portfolio weights
- $\mu$ is the vector of expected asset returns
- $\Sigma$ is the covariance matrix of asset returns
- $\lambda > 0$ is a risk-aversion parameter
Decision Variables¶
- $w_i$: portfolio weight allocated to asset $i$
Parameters¶
- $\mu_i$: expected return of asset $i$
- $\Sigma_{ij}$: covariance between returns of assets $i$ and $j$
- $\lambda$: risk-aversion coefficient
Constraints¶
- Long-only: portfolio weights must be non-negative ($w_i \ge 0$).
- Fully invested: portfolio weights sum to one ($\sum_i w_i = 1$).
- Maximum per-asset allocation: $w_i \le 0.25$ to prevent excessive concentration.
This optimization problem is quadratic programming (QP) problem
- The objective function contains a quadratic term $w^\top \Sigma w$
- The covariance matrix $\Sigma$ is positive semidefinite
- All constraints are linear
#!pip install pandas_datareader
# Install and import necessary libraries
#!pip install yfinance cvxpy pandas numpy matplotlib seaborn
import pandas as pd
import numpy as np
import cvxpy as cp
import matplotlib.pyplot as plt
import seaborn as sns
from pandas_datareader import data as pdr
Data Acquisition and Cleaning¶
We begin by selecting a representative universe of financial assets and obtaining their historical price data from publicly available sources. To ensure reproducibility and eliminate reliance on proprietary services or API keys, we use historical market data from Stooq, a freely accessible financial data repository .
The asset universe consists of liquid ETFs spanning multiple asset classes, such as U.S. equities, sector-specific equities, government bonds, and commodities. Historical daily closing prices are retrieved over a multi-year horizon to capture different market conditions and provide sufficient data for robust estimation of expected returns and risk.
tickers_dict = {
# Broad Equity
"SPY.US": "U.S. broad-market equity (S&P 500)",
"VTI.US": "Total U.S. stock market",
# Sector-Specific Equities
"XLK.US": "Technology sector",
"XLF.US": "Financials sector",
"XLE.US": "Energy sector",
"XLV.US": "Healthcare sector",
"XLY.US": "Consumer Discretionary sector",
"XLI.US": "Industrials sector",
"XLRE.US": "Real Estate sector",
# Size / Style
"IWM.US": "Russell 2000 ETF",
"QQQ.US": "NASDAQ 100 ETF (tech-heavy)",
# International Equity
"EFA.US": "Developed markets (ex-US)",
"EEM.US": "Emerging markets",
"VXUS.US": "Total International equity",
# Government Bonds
"IEF.US": "U.S. Treasuries 7–10Y",
"TLT.US": "U.S. Treasuries 20+Y",
"SHY.US": "Short-term Treasuries",
# Corporate / High Yield
"LQD.US": "Investment-grade corporate bonds",
"HYG.US": "High-yield corporate bonds",
# Commodities
"GLD.US": "Gold",
"SLV.US": "Silver",
"DBC.US": "Broad commodities",
"USO.US": "Crude Oil",
"UNG.US": "Natural Gas",
# Alternative / Real Assets
"VNQ.US": "U.S. Real Estate (REITs)",
"ICLN.US": "Global Clean Energy"
# Cryptocurrency
#"BTC-USD": "Bitcoin (USD-denominated)"
}
tickers = list(tickers_dict.keys())
print(len(tickers))
26
# Define asset tickers and download data (3 years of daily data)
start_date = "2023-01-01"
end_date = "2025-12-20"
prices = pd.DataFrame()
for ticker in tickers:
print(f"Downloading data for {ticker}...")
data = pdr.DataReader(ticker, "stooq", start_date, end_date)
prices[ticker] = data["Close"]
prices = prices.sort_index()
Downloading data for SPY.US... Downloading data for VTI.US... Downloading data for XLK.US... Downloading data for XLF.US... Downloading data for XLE.US... Downloading data for XLV.US... Downloading data for XLY.US... Downloading data for XLI.US... Downloading data for XLRE.US... Downloading data for IWM.US... Downloading data for QQQ.US... Downloading data for EFA.US... Downloading data for EEM.US... Downloading data for VXUS.US... Downloading data for IEF.US... Downloading data for TLT.US... Downloading data for SHY.US... Downloading data for LQD.US... Downloading data for HYG.US... Downloading data for GLD.US... Downloading data for SLV.US... Downloading data for DBC.US... Downloading data for USO.US... Downloading data for UNG.US... Downloading data for VNQ.US... Downloading data for ICLN.US...
print("Missing values per ticker:")
#print(prices.isna().sum())
# Drop rows with any missing values
prices = prices.dropna()
Missing values per ticker:
plt.figure(figsize=(28, 14))
for ticker in prices.columns:
plt.plot(prices.index, prices[ticker], label=tickers_dict[ticker])
plt.title("Asset Prices (Daily Close)")
plt.xlabel("Date")
plt.ylabel("Price")
plt.legend(loc='upper left', bbox_to_anchor=(1,1))
plt.grid(True)
plt.show()
## Returns & Risk Estimation
# Daily returns
returns = prices.pct_change().dropna()
trading_days_per_year = len(returns) / ((returns.index[-1] - returns.index[0]).days / 365)
# Annualized statistics
mu = returns.mean() * trading_days_per_year
Sigma = returns.cov() * trading_days_per_year
# --- Weekly returns ---
weekly_returns = prices.resample('W').last().pct_change().dropna()
trading_weeks_per_year = 52 # standard
mu_weekly = weekly_returns.mean() * trading_weeks_per_year
# --- Monthly returns ---
monthly_returns = prices.resample('ME').last().pct_change().dropna()
trading_months_per_year = 12
mu_monthly = monthly_returns.mean() * trading_months_per_year
# --- Combine into DataFrame ---
mu_df = pd.DataFrame({
'Asset': [tickers_dict[t] for t in prices.columns],
'Annualized Daily Return': mu.values,
'Annualized Weekly Return': mu_weekly.values,
'Annualized Monthly Return': mu_monthly.values
})
# Sort by Annualized Daily Return descending
mu_df = mu_df.sort_values(by='Annualized Monthly Return', ascending=False).reset_index(drop=True)
mu_df
| Asset | Annualized Daily Return | Annualized Weekly Return | Annualized Monthly Return | |
|---|---|---|---|---|
| 0 | Silver | 0.384067 | 0.387958 | 0.393563 |
| 1 | Gold | 0.299218 | 0.292404 | 0.286397 |
| 2 | NASDAQ 100 ETF (tech-heavy) | 0.311032 | 0.304409 | 0.271663 |
| 3 | U.S. broad-market equity (S&P 500) | 0.218446 | 0.210848 | 0.195445 |
| 4 | Total U.S. stock market | 0.204311 | 0.196916 | 0.179124 |
| 5 | Industrials sector | 0.168020 | 0.159802 | 0.157793 |
| 6 | Financials sector | 0.172198 | 0.162923 | 0.152073 |
| 7 | Emerging markets | 0.145270 | 0.128816 | 0.115537 |
| 8 | Russell 2000 ETF | 0.147370 | 0.137934 | 0.111278 |
| 9 | Developed markets (ex-US) | 0.134415 | 0.125603 | 0.105827 |
| 10 | Total International equity | 0.131797 | 0.121534 | 0.103827 |
| 11 | Technology sector | 0.144699 | 0.137155 | 0.103551 |
| 12 | Healthcare sector | 0.054585 | 0.054160 | 0.060555 |
| 13 | High-yield corporate bonds | 0.030829 | 0.022513 | 0.018823 |
| 14 | Crude Oil | 0.045700 | 0.060544 | 0.018543 |
| 15 | Real Estate sector | 0.044486 | 0.036616 | 0.011133 |
| 16 | Consumer Discretionary sector | 0.068886 | 0.055095 | 0.009095 |
| 17 | Short-term Treasuries | 0.006527 | 0.005279 | 0.004174 |
| 18 | U.S. Real Estate (REITs) | 0.039841 | 0.031769 | 0.003941 |
| 19 | Investment-grade corporate bonds | 0.015634 | 0.007656 | 0.001178 |
| 20 | U.S. Treasuries 7–10Y | 0.001794 | -0.004978 | -0.007952 |
| 21 | Broad commodities | -0.005557 | 0.000307 | -0.022092 |
| 22 | U.S. Treasuries 20+Y | -0.008399 | -0.021982 | -0.029124 |
| 23 | Global Clean Energy | -0.030941 | -0.036411 | -0.050184 |
| 24 | Energy sector | -0.129084 | -0.142963 | -0.158374 |
| 25 | Natural Gas | -0.312402 | -0.285856 | -0.263702 |
import seaborn as sns
import matplotlib.pyplot as plt
# Compute correlation matrix
corr = returns.corr()
# Rename rows and columns using full asset names
corr_named = corr.rename(index=tickers_dict, columns=tickers_dict)
g= sns.clustermap(
corr_named,
cmap="magma",
figsize=(16, 14),
linewidths=0.5,
cbar_kws={'label': 'Correlation'},
method='average',
metric='correlation')
g.fig.text(
0.5, # x-position (center)
0.875, # y-position (near bottom)
"Correlation Matrix with Clustered Ordering (Dendrogram Hidden)",
ha='center',
fontsize=16
)
plt.show()
Mathematical Formulation¶
Let:
- $\mathbf{w} = [w_1, w_2, \dots, w_n]^\top$ be the portfolio weights,
- $\mathbf{r} = [r_1, r_2, \dots, r_n]^\top$ be the expected returns,
- $\mathbf{C}$ be the covariance matrix of asset returns.
The portfolio expected return and variance are:
$$ r_p = \mathbf{w}^\top \mathbf{r}, \qquad \sigma_p^2 = \mathbf{w}^\top \mathbf{C} \mathbf{w}. $$
Objectives¶
We aim to maximize return and minimize risk simultaneously. Define:
$$ f_1(\mathbf{w}) = -\mathbf{r}^\top \mathbf{w} \quad \text{(maximize return)}, \qquad f_2(\mathbf{w}) = \mathbf{w}^\top \mathbf{C} \mathbf{w} \quad \text{(minimize risk)}. $$
A weighted scalarization of the two objectives allows us to trace the efficient frontier:
$$ \min_{\mathbf{w}} \ -\alpha \, \mathbf{r}^\top \mathbf{w} + (1-\alpha) \, \mathbf{w}^\top \mathbf{C} \mathbf{w}, \quad 0 \le \alpha \le 1. $$
- $\alpha = 0$ → minimize risk only.
- $\alpha = 1$ → maximize return only.
- Varying $\alpha$ from 0 to 1 produces the Pareto-optimal frontier.
Constraints¶
$$ \sum_{i=1}^n w_i = 1, \quad 0 \le w_i \le 0.25, \quad \forall i. $$
By sweeping the trade-off parameter $\alpha$ in the weighted scalarization problem, we can compute a set of optimal portfolios that balance return and risk under realistic investment constraints. This approach provides both practical portfolio recommendations and a visualization of the efficient frontier, aiding investment decision-making in multi-asset portfolios.
## CVXPY Implementation
n = len(tickers)
w = cp.Variable(n)
# Weighted scalarization function
def solve_portfolio(alpha):
objective = cp.Minimize(-alpha * mu.values @ w + (1 - alpha) * cp.quad_form(w, Sigma.values))
constraints = [cp.sum(w) == 1, w >= 0, w <= 0.25]
prob = cp.Problem(objective, constraints)
prob.solve()
return w.value
alphas = np.linspace(0, 1, 100)
frontier_returns = []
frontier_risks = []
weights_list = []
for alpha in alphas:
weights = solve_portfolio(alpha)
weights_list.append(weights.round(4))
r_p = mu.values @ weights
sigma_p = np.sqrt(weights.T @ Sigma.values @ weights)
frontier_returns.append(r_p)
frontier_risks.append(sigma_p)
# Convert weights to DataFrame
weights_df = pd.DataFrame(weights_list, columns=list(tickers_dict.keys()))
portfolio_index = np.arange(len(frontier_risks)) + 1
num_assets = len(tickers_dict)
# Colors: first 20 solid, next 10 lighter
base_colors = plt.get_cmap('tab20').colors
extra_colors = plt.get_cmap('tab20b').colors
colors = list(base_colors) + list(extra_colors)
colors = colors[:num_assets]
# Hatches: first 20 none, rest diagonal '/'
if num_assets <= 10:
hatches = [""]*num_assets
elif num_assets <= 20:
hatches = [""]*10 + ["/"]*(num_assets-10)
else:
hatches = [""]*10 + ["/"]*(10) + ["\\"]*(num_assets-20)
# Create figure with 2 subplots
fig, axes = plt.subplots(2, 1, figsize=(22, 16))
# Top plot: Efficient Frontier
axes[0].plot(frontier_risks, frontier_returns, marker='o', color='b', linewidth=2)
axes[0].set_xlabel("Portfolio Risk (Std Dev)", fontsize=12)
axes[0].set_ylabel("Portfolio Return", fontsize=12)
axes[0].set_title("Efficient Frontier: Risk vs Return", fontsize=14)
axes[0].grid(True)
# Bottom plot: Asset Weights along Frontier with hatches
bottom = np.zeros(len(portfolio_index))
for i, col in enumerate(weights_df.columns):
axes[1].fill_between(portfolio_index,
bottom,
bottom + weights_df[col].values,
color=colors[i],
hatch=hatches[i],
edgecolor='k',
linewidth=0.5,
label=tickers_dict[col])
bottom += weights_df[col].values
# Overlay scaled risk and return
axes[1].plot(portfolio_index, np.array(frontier_risks)/max(frontier_risks), color='k', linestyle='--', linewidth=2, label='Risk (scaled)')
axes[1].plot(portfolio_index, np.array(frontier_returns)/max(frontier_returns), color='r', linestyle='-', linewidth=2, label='Return (scaled)')
axes[1].set_xlabel("Portfolio along Efficient Frontier", fontsize=12)
axes[1].set_ylabel("Asset Weights / Scaled Risk & Return", fontsize=12)
axes[1].set_title("Asset Allocation Along Efficient Frontier", fontsize=14)
axes[1].legend(loc='upper left', bbox_to_anchor=(1,1), fontsize=10)
axes[1].grid(True)
plt.tight_layout()
plt.show()
III - Vehicle Routing: A Metaheuristic Case Study.¶
Problem Definition¶
We consider a real-world logistics challenge: the Vehicle Routing Problem (VRP). In VRP a fleet of vehicles must deliver goods from a depot to a set of geographically dispersed customers
- Travelling Salesman Problem (TSP): Given a single vehicle, find the minimum-cost tour that visits each customer exactly once and returns to the depot.
- Vehicle Routing Problem (VRP): A generalization of the TSP in which multiple vehicles are available; customers must be partitioned among vehicles, and each vehicle performs a route starting and ending at the depot, subject to constraints such as vehicle capacity.
- Vehicle Routing Problem with Time Windows (VRPTW): An extension of the VRP where each customer must be serviced within a specified time interval, and vehicle routes must respect both capacity constraints and time-window feasibility.
Formulation¶
We know:
- The demand (quantity required) of each customer
- The travel cost (or distance/time) between every pair of locations (customers and depot)
- The number of available vehicles
- The capacity of each vehicle
- A time window for each customer $i$, denoted by $[a_i, b_i]$
We must determine which customers are assigned to each vehicle and the order in which they are served, so as to minimize the total routing cost while satisfying all capacity and time-window constraints.
Objective Function¶
- Minimize total travel cost: $\min \sum_{i,j} c_{ij} \sum_k x_{ijk}$
Decision Variables¶
- $x_{ijk} = 1$ if vehicle $k$ travels directly from $i$ to $j$, $0$ otherwise
- $T_{ik}$ = service start time of vehicle $k$ at customer $i$
Parameters¶
- $c_{ij}$ = travel cost from $i$ to $j$
- $t_{ij}$ = travel time from $i$ to $j$
- $q_i$ = demand of customer $i$
- $Q_k$ = capacity of vehicle $k$
- $[a_i, b_i]$ = time window of customer $i$
- $s_i$ = service time at customer $i$ (assumed 0 in this case-study)
- $M$ = sufficiently large constant
Constraints¶
- Exactly one vehicle enters each customer (except depot): $\sum_i \sum_k x_{ijk} = 1 \;\forall j \neq \text{depot}$
- Exactly one vehicle leaves each customer (except depot): $\sum_j \sum_k x_{ijk} = 1 \;\forall i \neq \text{depot}$
- Flow conservation (same vehicle enters and leaves): $\sum_i x_{ihk} - \sum_j x_{hjk} = 0 \;\forall h,k$
- Vehicle capacity: $\sum_i q_i \sum_j x_{ijk} \le Q_k \;\forall k$
- Subtour elimination (Prevent disconnected cycles that do not include the depot): $\sum_{i,j \in S} x_{ijk} \le |S| - 1 \;\forall S \subseteq N,\; S \neq \emptyset,\; \forall k$
- Service must start within the allowed time window: $a_i \le T_{ik} \le b_i \;\forall i \neq \text{depot},\forall k$
- Time consistency along routes: $T_{jk} \ge T_{ik} + s_i + t_{ij} - M(1 - x_{ijk}) \;\forall i,j,k$
M is a large constant used in Big-M constraints to turn constraints on or off depending on a binary decision variable.
This problem is combinatorial and non-convex, it generalizes the Traveling Salesman Problem and is therefore NP-hard (i.e. solution time grows exponentially with problem size).
Binary routing variables: The decision variables $x_{ijk} \in \{0,1\}$ make the Vehicle Routing Problem with Time Windows (VRPTW) a mixed-integer optimization problem. Each variable indicates whether vehicle ( k ) travels directly from customer ( i ) to customer ( j ). There is no concept of a fractional route or “half a vehicle,” unlike continuous flow variables in linear programming.
Exponential combinations: Assigning $n$ customers to $m$ vehicles and determining the service sequence for each vehicle leads to a factorial / exponential growth in the number of possible solutions. Even for moderately sized instances, the number of feasible routing combinations is astronomically large.
This combinatorial explosion is the fundamental reason why exact solution methods scale poorly and why heuristic and metaheuristic approaches are commonly used in practice.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from itertools import product
import time
#!pip install ortools
#!pip install python-tsp
from python_tsp.exact import solve_tsp_dynamic_programming
from ortools.constraint_solver import routing_enums_pb2
from ortools.constraint_solver import pywrapcp
Data Source¶
For our case-study we use a benchmark VRPTW dataset based on the classical Solomon instances. Specifically, we utilize the extended Multi-Time-Window VRP dataset recently published on Zenodo, which augments Solomon’s problems with multiple time windows for each customer. This provides realistic distance coordinates, demands, and time-window constraints for a multi-vehicle delivery scenario.
df_raw = pd.read_csv('data/multiple time window VRP dataset/multiple time window VRP dataset/data/solomon/C101_MTW.csv')
# randomly select k customers along with the depot (CUST_NO 0)
k = 19
df = df_raw.sample(n=k, random_state=1).reset_index(drop=True)
df = pd.concat([pd.DataFrame([df_raw.iloc[0]]), df]).reset_index(drop=True)
print(len(df))
#df.head()
#print(df.columns)
20
depot = df[df['CUST_NO'] == 0]
customers = df[df['CUST_NO'] != 0]
# Create the plot
plt.figure(figsize=(8, 6))
# Plot depot with red square
plt.scatter(depot['XCOORD'], depot['YCOORD'], color='red', marker='s', s=100, label='Depot (Customer 0)')
# Plot other customers with blue circles
plt.scatter(customers['XCOORD'], customers['YCOORD'], color='blue', marker='o', s=60, label='Customers')
# Annotate all points with their customer number
for _, row in df.iterrows():
plt.text(row['XCOORD'] + 0.5, row['YCOORD'] + 0.5, str(row['CUST_NO']), fontsize=9)
# Set plot labels and title
plt.xlabel('X Coordinate')
plt.ylabel('Y Coordinate')
plt.title('Customer Locations with Depot')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
# sort customers
df_plot = df.sort_values("CUST_NO")
# remove 0 customer number
df_plot = df_plot[df_plot["CUST_NO"] != 0]
# y positions
y_pos = range(len(df_plot))
# time window lengths
window_length = df_plot["DUE_TIME_1"] - df_plot["READY_TIME_1"]
plt.figure(figsize=(10, 6))
# horizontal bars
plt.barh(
y_pos,
window_length,
left=df_plot["READY_TIME_1"],
alpha=0.8
)
# y-axis labels
plt.yticks(y_pos, df_plot["CUST_NO"])
# labels and title
plt.xlabel("Time (minutes)")
plt.ylabel("Customer")
plt.title("Customer Time Windows")
plt.grid(axis="x", linestyle="--", alpha=0.5)
plt.tight_layout()
plt.show()
depot_index = int(df.index[df['CUST_NO'] == 0][0])
# Compute Euclidean distance matrix
coords_array = df[['XCOORD','YCOORD']].to_numpy()
dist_matrix = np.sqrt(((coords_array[:,None,:] - coords_array[None,:,:])**2).sum(axis=2))
scale = 1
scaled_dist_matrix = np.rint(dist_matrix * scale).astype(int)
Scenario 1 - Traveling Salesman Problem¶
In this scenario, a single vehicle starts from a depot, visits all customers exactly once, and returns to the depot. The objective is to minimize the total travel distance.
- One vehicle
- No capacity constraints
- No time window constraints
## Data Model
data = {}
data['distance_matrix_scaled'] = scaled_dist_matrix
data['num_vehicles'] = 1
data['depot'] = depot_index
Option 1 - Exact Solution¶
We will first approach this using this problem using an exact dynamic programming algorithm (Held–Karp). The method is exact and guarantees optimality, but it is slow because it must enumerate all $2^𝑛$ subsets of customers, leading to exponential time complexity
D = np.array(scaled_dist_matrix)
# Reorder so the depot becomes node 0 (python-tsp assumes start at 0)
n = D.shape[0]
perm = [depot_index] + [i for i in range(n) if i != depot_index]
D2 = D[np.ix_(perm, perm)]
# Exact TSP (cycle). Returns permutation starting at 0 and the optimal tour length.
t_start = time.perf_counter()
# Exact TSP (Held–Karp)
_, total_distance = solve_tsp_dynamic_programming(D2)
t_end = time.perf_counter()
print(f"Exact TSP (Held–Karp) took {t_end - t_start:.4f} seconds")
print("Optimal total distance:", total_distance)
Exact TSP (Held–Karp) took 82.1020 seconds Optimal total distance: 305
Option 2 – Metaheuristic Solution¶
To overcome the computational limitations of exact methods, we employ a metaheuristic approach using Google OR-Tools. The Traveling Salesman Problem is modeled as a weighted graph, and a feasible tour is first constructed using a greedy heuristic (Path Cheapest Arc), which incrementally selects the lowest-cost admissible edges. This initial solution is then refined through local search techniques that explore neighboring solutions to reduce total travel cost. While this approach does not guarantee optimality, it efficiently produces high-quality, near-optimal solutions and scales well to large problem instances where exact optimization is impractical.
# Create the routing model
# The pywrapcp.RoutingIndexManager is responsible for translating between external
# node indices (which you use in your problem definition) and internal solver indices.
# The inputs to RoutingIndexManager are:
# - The number of rows of the distance matrix, which is the number of locations (including the depot).
# - The number of vehicles in the problem.
# - The node corresponding to the depot.
manager = pywrapcp.RoutingIndexManager(
len(data["distance_matrix_scaled"]), data["num_vehicles"], data["depot"]
)
routing = pywrapcp.RoutingModel(manager)
# Create the distance callback
# To use the routing solver, you need to create a distance (or transit) callback:
# a function that takes any pair of locations and returns the distance between them.
# The easiest way to do this is using the distance matrix.
def distance_callback(from_index, to_index):
"""Returns the distance between the two nodes."""
# Convert from routing variable Index to distance matrix NodeIndex.
from_node = manager.IndexToNode(from_index)
to_node = manager.IndexToNode(to_index)
return data["distance_matrix_scaled"][from_node][to_node]
transit_callback_index = routing.RegisterTransitCallback(distance_callback)
# Set the cost of travel
# The arc cost evaluator tells the solver how to calculate the cost of travel
# between any two locations — in other words, the cost of the edge (or arc)
# joining them in the graph for the problem. The following code sets the arc cost
# evaluator.
routing.SetArcCostEvaluatorOfAllVehicles(transit_callback_index)
# Set search parameters
# The code sets the first solution strategy to PATH_CHEAPEST_ARC,
# which creates an initial route for the solver by repeatedly adding edges
# with the least weight that don't lead to a previously visited node
# (other than the depot).
search_parameters = pywrapcp.DefaultRoutingSearchParameters()
search_parameters.first_solution_strategy = (
routing_enums_pb2.FirstSolutionStrategy.PATH_CHEAPEST_ARC
)
# Solution Printer
def print_solution(manager, routing, solution):
"""Prints solution on console."""
print(f"Objective: {solution.ObjectiveValue()} miles")
index = routing.Start(0)
plan_output = "Route for vehicle 0:\n"
route_distance = 0
while not routing.IsEnd(index):
plan_output += f" {manager.IndexToNode(index)} ->"
previous_index = index
index = solution.Value(routing.NextVar(index))
route_distance += routing.GetArcCostForVehicle(previous_index, index, 0)
plan_output += f" {manager.IndexToNode(index)}\n"
plan_output += f"Route distance: {route_distance}miles\n"
print(plan_output)
t_start = time.perf_counter()
solution = routing.SolveWithParameters(search_parameters)
t_end = time.perf_counter()
print(f"OR Tools took {t_end - t_start:.4f} seconds")
if solution:
print_solution(manager, routing, solution)
OR Tools took 0.0358 seconds Objective: 305 miles Route for vehicle 0: 0 -> 15 -> 17 -> 7 -> 12 -> 10 -> 13 -> 16 -> 5 -> 3 -> 4 -> 14 -> 1 -> 11 -> 6 -> 8 -> 9 -> 18 -> 2 -> 19 -> 0 Route distance: 305miles
def plot_tsp_route_ortools(df, manager, routing, solution,
vehicle_id=0,
distance_matrix=None, # e.g., data['distance_matrix_scaled'] or data['distance_matrix']
scale=1, # e.g., 1000 if using scaled costs, else 1
x_col="XCOORD", y_col="YCOORD", node_col="CUST_NO",
show_labels=True):
"""
One-stop function:
1) Extracts the OR-Tools route for `vehicle_id`
2) Plots points + arrows
3) Computes total route distance using your EXISTING distance matrix
(so it matches what OR-Tools optimized)
Notes:
- Assumes IndexToNode outputs indices aligned with df row order (df reset_index recommended).
- distance_matrix should be the same one you used for costs, typically scaled.
"""
if solution is None:
raise ValueError("No solution found (solution is None).")
if distance_matrix is None:
raise ValueError("Please pass distance_matrix (e.g., data['distance_matrix_scaled']).")
xs = df[x_col].to_numpy()
ys = df[y_col].to_numpy()
# --- Extract route nodes (including end depot) ---
index = routing.Start(vehicle_id)
route = [manager.IndexToNode(index)]
while not routing.IsEnd(index):
index = solution.Value(routing.NextVar(index))
route.append(manager.IndexToNode(index))
# --- Compute total distance from existing matrix ---
total_cost = 0
for a, b in zip(route[:-1], route[1:]):
total_cost += int(distance_matrix[a][b])
total_distance = total_cost / scale
# --- Plot ---
plt.figure(figsize=(8, 6))
plt.scatter(xs, ys)
# depot highlight (assumes depot is route[0])
depot_node = route[0]
plt.scatter([xs[depot_node]], [ys[depot_node]], s=140, marker="s", label="Depot")
# labels
if show_labels:
labels = df[node_col].to_numpy() if node_col in df.columns else np.arange(len(df))
for i in range(len(df)):
plt.text(xs[i], ys[i], str(labels[i]), fontsize=9, ha="left", va="bottom")
# arrows
for a, b in zip(route[:-1], route[1:]):
plt.annotate(
"",
xy=(xs[b], ys[b]),
xytext=(xs[a], ys[a]),
arrowprops=dict(arrowstyle="->", lw=1),
)
plt.title(f"TSP (vehicle {vehicle_id}) | Total distance = {total_distance:.2f}")
plt.xlabel(x_col)
plt.ylabel(y_col)
plt.axis("equal")
plt.grid(True)
plt.legend()
plt.show()
return route, total_distance
Scenario 2 – Capacitated Vehicle Routing Problem¶
In this scenario, there are K vehicles available at a central depot. Each vehicle has a maximum load capacity, and customer demands must be delivered without exceeding that capacity. The objective is to design a set of routes that minimizes total travel cost (or distance) while ensuring all customer demands are served.
- K vehicles
- Capacity constraints (total demand on each route must not exceed vehicle capacity)
- No time window constraints on customer visits
## Data Model
K = 4
vehicle_capacity = int(1.2*np.ceil(sum(df['DEMAND']) / K))
print(vehicle_capacity)
data = {}
data['distance_matrix_scaled'] = scaled_dist_matrix
data['num_vehicles'] = K
data['demands'] = df['DEMAND'].tolist()
data['vehicle_capacities'] = [vehicle_capacity]*K
data['depot'] = depot_index
105
# Create the routing model
manager = pywrapcp.RoutingIndexManager(
len(data["distance_matrix_scaled"]), data["num_vehicles"], data["depot"]
)
routing = pywrapcp.RoutingModel(manager)
# Create the distance callback
def distance_callback(from_index, to_index):
"""Returns the distance between the two nodes."""
# Convert from routing variable Index to distance matrix NodeIndex.
from_node = manager.IndexToNode(from_index)
to_node = manager.IndexToNode(to_index)
return int(data["distance_matrix_scaled"][from_node][to_node])
transit_callback_index = routing.RegisterTransitCallback(distance_callback)
# Define cost of each arc.
routing.SetArcCostEvaluatorOfAllVehicles(transit_callback_index)
# Add Capacity constraint.
def demand_callback(from_index):
"""Returns the demand of the node."""
# Convert from routing variable Index to demands NodeIndex.
from_node = manager.IndexToNode(from_index)
return int(data["demands"][from_node])
demand_callback_index = routing.RegisterUnaryTransitCallback(demand_callback)
routing.AddDimensionWithVehicleCapacity(
demand_callback_index,
0, # null capacity slack
data["vehicle_capacities"], # vehicle maximum capacities
True, # start cumul to zero
"Capacity",
)
# Set search parameters
# Setting first solution heuristic.
search_parameters = pywrapcp.DefaultRoutingSearchParameters()
search_parameters.first_solution_strategy = (
routing_enums_pb2.FirstSolutionStrategy.PATH_CHEAPEST_ARC
)
search_parameters.local_search_metaheuristic = (
routing_enums_pb2.LocalSearchMetaheuristic.GUIDED_LOCAL_SEARCH
)
search_parameters.time_limit.FromSeconds(1)
solution = routing.SolveWithParameters(search_parameters)
def plot_cvrp_routes_ortools(df, manager, routing, solution,
distance_matrix=None,
scale=1,
x_col="XCOORD", y_col="YCOORD", node_col="CUST_NO",
show_labels=True,
show_arrows=True):
"""
CVRP visualizer with:
- per-vehicle colored routes
- per-vehicle distance + load in legend
- total distance + total load in title
"""
if solution is None:
raise ValueError("No solution found.")
if distance_matrix is None:
raise ValueError("Please pass distance_matrix.")
xs = df[x_col].to_numpy()
ys = df[y_col].to_numpy()
labels = None
if show_labels:
labels = df[node_col].to_numpy() if node_col in df.columns else np.arange(len(df))
depot_node = manager.IndexToNode(routing.Start(0))
# Capacity dimension (if present)
try:
cap_dim = routing.GetDimensionOrDie("Capacity")
except Exception:
cap_dim = None
routes = {}
per_vehicle_distance = {}
per_vehicle_load = {}
total_cost = 0
total_load = 0
# --- Extract routes, distance, load ---
for v in range(routing.vehicles()):
start_index = routing.Start(v)
if solution.Value(routing.NextVar(start_index)) == routing.End(v):
continue
index = start_index
route = [manager.IndexToNode(index)]
route_cost = 0
while not routing.IsEnd(index):
node = manager.IndexToNode(index)
next_index = solution.Value(routing.NextVar(index))
next_node = manager.IndexToNode(next_index)
route.append(next_node)
route_cost += int(distance_matrix[node][next_node])
index = next_index
routes[v] = route
per_vehicle_distance[v] = route_cost / scale
total_cost += route_cost
if cap_dim is not None:
load = solution.Value(cap_dim.CumulVar(routing.End(v)))
else:
load = 0
per_vehicle_load[v] = load
total_load += load
total_distance = total_cost / scale
# --- Plot base points ---
plt.figure(figsize=(9, 7))
plt.scatter(xs, ys, label="Customers")
plt.scatter([xs[depot_node]], [ys[depot_node]],
s=180, marker="s", label="Depot")
if show_labels:
for i in range(len(df)):
plt.text(xs[i], ys[i], str(labels[i]),
fontsize=9, ha="left", va="bottom")
# --- Colors ---
used_vehicles = list(routes.keys())
cmap = plt.get_cmap("tab10") if len(used_vehicles) <= 10 else plt.get_cmap("tab20")
# --- Draw routes ---
for idx, v in enumerate(used_vehicles):
route = routes[v]
color = cmap(idx % cmap.N)
for a, b in zip(route[:-1], route[1:]):
x1, y1 = xs[a], ys[a]
x2, y2 = xs[b], ys[b]
plt.plot([x1, x2], [y1, y2], color=color, lw=1.8)
if show_arrows:
plt.annotate(
"",
xy=(x2, y2),
xytext=(x1, y1),
arrowprops=dict(arrowstyle="->", lw=1.8, color=color),
)
# legend entry
plt.plot(
[],
[],
color=color,
lw=3,
label=f"Vehicle {v} | dist={per_vehicle_distance[v]:.2f}, load={per_vehicle_load[v]}"
)
plt.title(
f"CVRP Routes | Vehicles used = {len(routes)} | "
f"Total distance = {total_distance:.2f} | Total load = {total_load}"
)
plt.xlabel(x_col)
plt.ylabel(y_col)
plt.axis("equal")
plt.grid(True)
plt.legend()
plt.show()
return routes, total_distance, per_vehicle_distance, per_vehicle_load
routes, total_dist, per_v_dist, per_v_load = plot_cvrp_routes_ortools(
df, manager, routing, solution,
distance_matrix=data['distance_matrix_scaled'],
scale=1,
show_labels=True,
show_arrows=True
)
print("Total distance:", total_dist)
print("Per-vehicle:", per_v_dist)
Scenario 3 – Vehicle Routing with Time Windows¶
In this scenario, a fleet of $K$ vehicles is available at a central depot. Each vehicle has a fixed load capacity, and customer demands must be delivered without exceeding vehicle capacity constraints. Each customer must also be visited within a specified time window.
The objective is to design a set of vehicle routes that minimizes total travel cost (or distance) while ensuring that all customer demands are satisfied and all operational constraints are respected.
- $K$ vehicles starting and ending at a central depot
- Capacity constraints: total demand served on each route must not exceed vehicle capacity
- Time window constraints on customer service times
## Data Model
K = 4
vehicle_capacity = int(1.2*np.ceil(sum(df['DEMAND']) / K))
print(vehicle_capacity)
data = {}
data['distance_matrix_scaled'] = scaled_dist_matrix
data['num_vehicles'] = K
data['demands'] = df['DEMAND'].tolist()
#data['vehicle_capacities'] = [vehicle_capacity]*K
data['time_windows'] = list(zip(df['READY_TIME_1'], df['DUE_TIME_1']))
data['depot'] = depot_index
105
IV – Cancer Gene Expression Feature Selection – Genetic Algorithm Case Study¶
Genetic algorithms (GAs) are population-based evolutionary optimization methods inspired by the principles of natural selection. They iteratively evolve a set of candidate solutions through selection, crossover, and mutation, enabling effective exploration of large and complex search spaces. GAs are particularly well suited for non-convex, discrete, and combinatorial optimization problems, where the objective function is non-differentiable and characterized by many local optima. In this case study, we apply a GA to the problem of gene selection from high-dimensional cancer gene expression data, with the goal of identifying a compact subset of genes that maximizes predictive performance.
Gene expression datasets are a canonical example of a challenging optimization problem in biomedical machine learning. They typically contain thousands to tens of thousands of gene expression measurements for only a few hundred or thousand samples, resulting in an extreme high-dimensional, low-sample-size regime. Each sample corresponds to a patient, and the target label represents a clinical outcome such as cancer subtype or disease status. Predictive signal in such data rarely resides in individual genes; instead, it emerges from complex, nonlinear interactions among small groups of genes, often reflecting underlying biological pathways.
Genetic Algorithm Formulation¶
Variables: Each candidate solution is represented as a binary chromosome of length d, where d is the number of genes. A value of 1 indicates that the corresponding gene is included in the feature subset, while 0 indicates exclusion.
Objective: Maximize classifier accuracy on held-out validation folds. In our GA, the fitness of a mask is the cross-validated accuracy of a logistic regression model trained on the selected features.
Constraints: To encourage interpretability and biological plausibility, we optionally impose a maximum number of selected genes (e.g., 30–50). The GA naturally handles this discrete constraint without requiring relaxation or approximation.
GA Steps: An initial population of random masks is generated. In each generation, masks are evaluated (fitness), then the best are selected for reproduction. New offspring are created via crossover (combining bits from two parents) and mutation (flipping bits) to introduce diversity. Elite individuals may be carried over. This process repeats for many generations, gradually improving the feature subset.
Implementation¶
We use the open-source Python library sklearn-genetic-opt (which is built on DEAP) to handle the GA mechanics. This library implements GAFeatureSelectionCV, which wraps scikit-learn estimators in a GA that optimizes cross-validation score while minimizing feature count.
Why GA for feature selection?¶
GA is particularly well suited for this problem because feature selection is a non-convex, combinatorial optimization task with an exponentially large search space ($2^d$ possible subsets). Ensemble-based permutation feature importance methods (e.g., random permutation in random forests or gradient-boosted trees) evaluate features largely in isolation or under conditional perturbations, implicitly assuming that the contribution of each feature can be separated from the rest. This assumption fails in the presence of feature interactions, redundancy, and multicollinearity, where the predictive power emerges only from specific combinations of features rather than individual ones.
More importantly, permutation importance is diagnostic rather than optimization-driven: it explains a trained ensemble model but does not directly solve a well-defined optimization problem. Selecting features based on importance thresholds is heuristic and does not guarantee near-optimal performance for a downstream model. In contrast, a GA directly optimizes the target objective (e.g., cross-validated accuracy with a sparsity constraint) by evaluating entire feature subsets at once. By maintaining a population of candidate solutions and using crossover and mutation, GAs explore multiple regions of the non-convex landscape simultaneously and are far less prone to getting trapped in poor local optima. From an optimization standpoint, GAs are explicitly designed for discrete, non-differentiable, and non-convex problems, making them a principled and academically appropriate choice when the goal is global subset discovery rather than feature ranking.
#!pip install sklearn-genetic-opt
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
import numpy as np
import pandas as pd
from sklearn.base import clone
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance
from sklearn.metrics import accuracy_score, f1_score
from sklearn.linear_model import LogisticRegression
from sklearn_genetic import GAFeatureSelectionCV
from sklearn.compose import ColumnTransformer
SEED = 42
np.random.seed(SEED)
df = pd.read_csv('data/brca_data_w_subtypes/brca_data_w_subtypes.csv')
print("Dataset shape:", df.shape)
df.head()
target_col = "vital.status"
X = df.drop(columns=[target_col])
y = df[target_col]
print("Unique values:", y.unique())
print("Value counts:\n", y.value_counts())
feature_names = X.columns.tolist()
Dataset shape: (705, 1941) Unique values: [0 1] Value counts: vital.status 0 611 1 94 Name: count, dtype: int64
# Train / test split
# Larger test split to stress generalization (common in genomics)
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.8,
stratify=y,
random_state=42
)
print(f"Samples: {X.shape[0]}")
print(f"Genes: {X.shape[1]}")
print(f"Train samples: {X_train.shape[0]}")
print(f"Test samples: {X_test.shape[0]}")
Samples: 705 Genes: 1940 Train samples: 141 Test samples: 564
Just ~700 Sample and ~2000 features!!¶
num_cols = X_train.select_dtypes(include=["int64", "float64"]).columns
cat_cols = X_train.select_dtypes(include=["object", "category", "bool"]).columns
print(f"Numeric features: {len(num_cols)}")
print(f"Categorical features: {len(cat_cols)}")
Numeric features: 1936 Categorical features: 4
# preprocessing
preprocess_only = ColumnTransformer(
transformers=[
("num", StandardScaler(), num_cols),
("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols),
],
remainder="drop"
)
# Fit transform to get numeric matrix
X_train_enc = preprocess_only.fit_transform(X_train)
X_test_enc = preprocess_only.transform(X_test)
# Get encoded feature names (for reporting)
feat_names = preprocess_only.get_feature_names_out()
# Now GA runs on numeric encoded matrix
lr = LogisticRegression( max_iter=10000,
solver="saga",
class_weight="balanced",
n_jobs=-1,
random_state=SEED)
print(f"Total features: {len(feat_names)}")
Total features: 1953
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=SEED)
# Helper function: train/eval logistic regression on a given set of columns
def eval_logreg(col_idx, label, model=lr, scoring_cv="f1"):
"""
col_idx: list/array of integer feature indices OR boolean mask
"""
m = clone(model)
Xtr = X_train_enc[:, col_idx]
Xte = X_test_enc[:, col_idx]
# Fit
m.fit(Xtr, y_train)
# Test metrics
y_pred = m.predict(Xte)
test_acc = accuracy_score(y_test, y_pred)
test_f1 = f1_score(y_test, y_pred)
# CV on training
cv_score = cross_val_score(m, Xtr, y_train, cv=cv, scoring=scoring_cv).mean()
return {
"Approach": label,
"NumFeatures": Xtr.shape[1],
f"CV_{scoring_cv}_mean(train)": cv_score,
"Test_Accuracy": test_acc,
"Test_F1": test_f1,
}
results = []
## Baseline - all features
all_idx = np.arange(X_train_enc.shape[1])
results.append(eval_logreg(all_idx, "All features (LogReg)"))
results
[{'Approach': 'All features (LogReg)',
'NumFeatures': 1953,
'CV_f1_mean(train)': 0.22952958152958153,
'Test_Accuracy': 0.7890070921985816,
'Test_F1': 0.2874251497005988}]
The genetic algorithm maintains a population of candidate feature subsets and iteratively evolves them through selection, crossover, and mutation. Population size determines the diversity of candidate solutions explored in each generation, the number of generations controls the depth of evolutionary search, and elitism ensures that high-performing feature subsets are preserved across generations. A sparsity constraint limits the maximum number of selected features, guiding the search toward compact and generalizable solutions. Together, these parameters balance exploration and exploitation in a highly non-convex combinatorial search space.
ga_selector = GAFeatureSelectionCV(
estimator=lr, # LogisticRegression(...)
cv=cv,
scoring="average_precision",
population_size=100,
generations=100,
keep_top_k=5,
max_features=1000,
verbose=False,
n_jobs=1,
)
ga_selector.fit(X_train_enc, y_train)
ga_mask = np.array(ga_selector.best_features_, dtype=bool)
ga_idx = np.where(ga_mask)[0]
ga_cols = list(np.array(feat_names)[ga_idx])
print(f"Selected {len(ga_cols)} encoded features")
print(ga_cols[:20])
results.append(eval_logreg(ga_idx, f"GA-selected (n={len(ga_idx)})", scoring_cv="f1"))
results
[{'Approach': 'All features (LogReg)',
'NumFeatures': 1955,
'CV_f1_mean(train)': 0.27673350041771094,
'Test_Accuracy': 0.8632075471698113,
'Test_F1': 0.32558139534883723},
{'Approach': 'GA-selected (n=48)',
'NumFeatures': 48,
'CV_f1_mean(train)': 0.2914684603471789,
'Test_Accuracy': 0.8584905660377359,
'Test_F1': 0.21052631578947367},
{'Approach': 'GA-selected (n=88)',
'NumFeatures': 88,
'CV_f1_mean(train)': 0.40913804713804713,
'Test_Accuracy': 0.839622641509434,
'Test_F1': 0.2608695652173913}]
V. Tail-Risk Portfolio Optimization.¶
This case study shows how to build a tail-risk-aware portfolio using Conditional Value-at-Risk (CVaR) (also called Expected Shortfall).
Compared to mean–variance optimization, CVaR focuses explicitly on rare but severe losses, which is often what matters in practice.
We will:
- Pull daily price data for a basket of liquid ETFs (real market data).
- Convert prices to return scenarios.
- Solve a convex linear program that minimizes CVaR at confidence level $\alpha$ (e.g. 95%).
- Backtest out-of-sample and compare against a simple baseline (equal-weight).
Why CVaR?¶
- VaR at level $\alpha$ is the loss threshold exceeded only $(1-\alpha)$ of the time.
- CVaR at level $\alpha$ is the average loss conditional on exceeding VaR.
In other words: “how bad are things in the worst tail?”
Rockafellar & Uryasev showed CVaR minimization can be written as a linear program using auxiliary variables, making it scalable and globally solvable.
# (Optional) Install dependencies if needed
# !pip install -q yfinance pandas_datareader cvxpy
# If your environment lacks an LP solver, you can also install one of:
# !pip install -q ecos
# !pip install -q osqp
# !pip install -q clarabel
# Install and import necessary libraries
#!pip install yfinance cvxpy pandas numpy matplotlib seaborn
import pandas as pd
import numpy as np
import cvxpy as cp
import matplotlib.pyplot as plt
import seaborn as sns
from pandas_datareader import data as pdr
from pandas_datareader.stooq import StooqDailyReader
Data acquisition¶
We’ll use a handful of liquid ETFs spanning broad equities, bonds, commodities, and defensive exposures.
You can change the universe based on your use-case (e.g., region-specific, factor ETFs, credit, etc.).
tickers_dict = {
# Broad Equity
"SPY.US": "U.S. broad-market equity (S&P 500)",
"VTI.US": "Total U.S. stock market",
# Big Tech / Mega-Cap Equities
"AAPL.US": "Apple Inc.",
"MSFT.US": "Microsoft Corp.",
"GOOGL.US": "Alphabet Inc.",
"AMZN.US": "Amazon.com Inc.",
"TSLA.US": "Tesla Inc.",
"NVDA.US": "NVIDIA Corp.",
"META.US": "Meta Platforms Inc.",
"JPM.US": "JPMorgan Chase & Co.",
"JNJ.US": "Johnson & Johnson",
"V.US": "Visa Inc.",
"WMT.US": "Walmart Inc.",
"PG.US": "Procter & Gamble Co.",
"UNH.US": "UnitedHealth Group Inc.",
"HD.US": "Home Depot Inc.",
"MA.US": "Mastercard Inc.",
"DIS.US": "The Walt Disney Co.",
"BAC.US": "Bank of America Corp.",
"XOM.US": "Exxon Mobil Corp.",
"PFE.US": "Pfizer Inc.",
"KO.US": "Coca-Cola Co.",
"VZ.US": "Verizon Communications Inc.",
"ADBE.US": "Adobe Inc.",
"CMCSA.US": "Comcast Corp.",
"NFLX.US": "Netflix Inc.",
"T.US": "AT&T Inc.",
"CSCO.US": "Cisco Systems Inc.",
"PEP.US": "PepsiCo Inc.",
"INTC.US": "Intel Corp.",
"CRM.US": "Salesforce Inc.",
"ABNB.US": "Airbnb Inc.",
"PYPL.US": "PayPal Holdings Inc.",
"AVGO.US": "Broadcom Inc.",
"COST.US": "Costco Wholesale Corp.",
"QCOM.US": "Qualcomm Inc.",
"TXN.US": "Texas Instruments Inc.",
"AMD.US": "Advanced Micro Devices Inc.",
"ORCL.US": "Oracle Corp.",
"NKE.US": "Nike Inc.",
"SBUX.US": "Starbucks Corp.",
"CVX.US": "Chevron Corp.",
"BMY.US": "Bristol-Myers Squibb Co.",
"MDLZ.US": "Mondelez International Inc.",
# Sector-Specific Equities
"XLK.US": "Technology sector",
"XLF.US": "Financials sector",
"XLE.US": "Energy sector",
"XLV.US": "Healthcare sector",
"XLY.US": "Consumer Discretionary sector",
"XLI.US": "Industrials sector",
"XLRE.US": "Real Estate sector",
# Size / Style
"IWM.US": "Russell 2000 ETF",
"QQQ.US": "NASDAQ 100 ETF (tech-heavy)",
# International Equity
"EFA.US": "Developed markets (ex-US)",
"EEM.US": "Emerging markets",
"VXUS.US": "Total International equity",
# Government Bonds
"IEF.US": "U.S. Treasuries 7–10Y",
"TLT.US": "U.S. Treasuries 20+Y",
"SHY.US": "Short-term Treasuries",
# Corporate / High Yield
"LQD.US": "Investment-grade corporate bonds",
"HYG.US": "High-yield corporate bonds",
# Commodities
"GLD.US": "Gold",
"SLV.US": "Silver",
"DBC.US": "Broad commodities",
"USO.US": "Crude Oil",
"UNG.US": "Natural Gas",
# Alternative / Real Assets
"VNQ.US": "U.S. Real Estate (REITs)",
"ICLN.US": "Global Clean Energy"
# Cryptocurrency
#"BTC-USD": "Bitcoin (USD-denominated)"
}
tickers = list(tickers_dict.keys())
print(len(tickers))
print(len(tickers))
69 69
#tickers = ["SPY.US", "AAPL.US", "MSFT.US"]
# Define asset tickers and download data
start_date = "2023-01-01"
end_date = "2025-12-24"
prices = pd.DataFrame()
for ticker in tickers:
print(f"Downloading data for {ticker}...")
try:
#data = StooqDailyReader(ticker, start=start_date, end=end_date).read()
#prices[ticker] = data["Close"]
data = pdr.DataReader(ticker, "stooq", start_date, end_date)
prices[ticker] = data["Close"]
except Exception as e:
print(f"Failed to download data for {ticker}: {e}")
prices = prices.sort_index()
prices.tail()
Downloading data for SPY.US... Downloading data for VTI.US... Downloading data for AAPL.US... Downloading data for MSFT.US... Downloading data for GOOGL.US... Downloading data for AMZN.US... Downloading data for TSLA.US... Downloading data for NVDA.US... Downloading data for META.US... Downloading data for BRK.B.US... Failed to download data for BRK.B.US: 'Close' Downloading data for JPM.US... Downloading data for JNJ.US... Downloading data for V.US... Downloading data for WMT.US... Downloading data for PG.US... Downloading data for UNH.US... Downloading data for HD.US... Downloading data for MA.US... Downloading data for DIS.US... Downloading data for BAC.US... Downloading data for XOM.US... Downloading data for PFE.US... Downloading data for KO.US... Downloading data for VZ.US... Downloading data for ADBE.US... Downloading data for CMCSA.US... Downloading data for NFLX.US... Downloading data for T.US... Downloading data for CSCO.US... Downloading data for PEP.US... Downloading data for INTC.US... Downloading data for CRM.US... Downloading data for ABNB.US... Downloading data for PYPL.US... Downloading data for AVGO.US... Downloading data for COST.US... Downloading data for QCOM.US... Downloading data for TXN.US... Downloading data for AMD.US... Downloading data for ORCL.US... Downloading data for NKE.US... Downloading data for SBUX.US... Downloading data for CVX.US... Downloading data for BMY.US... Downloading data for MDLZ.US... Downloading data for XLK.US... Downloading data for XLF.US... Downloading data for XLE.US... Downloading data for XLV.US... Downloading data for XLY.US... Downloading data for XLI.US... Downloading data for XLRE.US... Downloading data for IWM.US... Downloading data for QQQ.US... Downloading data for EFA.US... Downloading data for EEM.US... Downloading data for VXUS.US... Downloading data for IEF.US... Downloading data for TLT.US... Downloading data for SHY.US... Downloading data for LQD.US... Downloading data for HYG.US... Downloading data for GLD.US... Downloading data for SLV.US... Downloading data for DBC.US... Downloading data for USO.US... Downloading data for UNG.US... Downloading data for VNQ.US... Downloading data for ICLN.US...
| SPY.US | VTI.US | AAPL.US | MSFT.US | GOOGL.US | AMZN.US | TSLA.US | NVDA.US | META.US | JPM.US | ... | SHY.US | LQD.US | HYG.US | GLD.US | SLV.US | DBC.US | USO.US | UNG.US | VNQ.US | ICLN.US | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Date | |||||||||||||||||||||
| 2025-12-18 | 676.47 | 333.25 | 272.19 | 483.98 | 302.46 | 226.76 | 483.37 | 174.14 | 664.45 | 313.00 | ... | 83.03 | 110.85 | 80.78 | 398.57 | 59.32 | 22.69 | 67.19 | 12.03 | 88.93 | 16.18 |
| 2025-12-19 | 680.59 | 336.22 | 273.67 | 485.92 | 307.16 | 227.35 | 481.20 | 180.99 | 658.77 | 317.21 | ... | 82.76 | 110.13 | 80.36 | 399.02 | 60.93 | 22.85 | 68.03 | 12.19 | 88.59 | 16.44 |
| 2025-12-22 | 684.83 | 337.60 | 270.97 | 484.92 | 309.78 | 228.43 | 488.73 | 183.69 | 661.50 | 323.09 | ... | 82.72 | 110.11 | 80.43 | 408.23 | 62.47 | 22.39 | 69.73 | 11.85 | 88.24 | 16.65 |
| 2025-12-23 | 687.96 | 338.72 | 272.36 | 486.85 | 314.35 | 232.14 | 485.56 | 189.21 | 664.94 | 325.93 | ... | 82.68 | 110.22 | 80.49 | 413.64 | 64.84 | 22.64 | 70.30 | 12.90 | 88.18 | 16.54 |
| 2025-12-24 | 690.38 | 339.88 | 273.81 | 488.02 | 314.09 | 232.38 | 485.40 | 188.61 | 667.55 | 329.17 | ... | 82.73 | 110.65 | 80.64 | 411.93 | 65.22 | 22.63 | 70.20 | 12.39 | 88.76 | 16.58 |
5 rows × 68 columns
Convert to returns and create scenario matrix¶
Let $r_{t}\in\mathbb{R}^n$ be the vector of asset returns at day $t$.
We will treat historical daily returns as scenarios for the CVaR optimization.
We then split into:
- Train window: used to fit the portfolio weights.
- Test window: used to evaluate out-of-sample.
# Daily simple returns
returns = prices.pct_change().dropna()
# Train/test split (e.g. 70/30)
split = int(0.7 * len(returns))
R_train = returns.iloc[:split].copy()
R_test = returns.iloc[split:].copy()
print("Train days:", len(R_train), "Test days:", len(R_test))
returns.describe()
C:\Users\salmank\AppData\Local\Temp\ipykernel_15256\3148299028.py:2: FutureWarning: The default fill_method='pad' in DataFrame.pct_change is deprecated and will be removed in a future version. Either fill in any non-leading NA values prior to calling pct_change or specify 'fill_method=None' to not fill NA values. returns = prices.pct_change().dropna()
Train days: 522 Test days: 225
| SPY.US | VTI.US | AAPL.US | MSFT.US | GOOGL.US | AMZN.US | TSLA.US | NVDA.US | META.US | JPM.US | ... | SHY.US | LQD.US | HYG.US | GLD.US | SLV.US | DBC.US | USO.US | UNG.US | VNQ.US | ICLN.US | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 747.000000 | 747.000000 | 747.000000 | 747.000000 | 747.000000 | 747.000000 | 747.000000 | 747.000000 | 747.000000 | 747.000000 | ... | 747.000000 | 747.000000 | 747.000000 | 747.000000 | 747.000000 | 747.000000 | 747.000000 | 747.000000 | 747.000000 | 747.000000 |
| mean | 0.000885 | 0.000824 | 0.001194 | 0.001081 | 0.001875 | 0.001537 | 0.002721 | 0.003953 | 0.002528 | 0.001299 | ... | 0.000025 | 0.000068 | 0.000127 | 0.001229 | 0.001614 | -0.000035 | 0.000224 | -0.001209 | 0.000161 | -0.000111 |
| std | 0.009688 | 0.009829 | 0.016162 | 0.014655 | 0.019125 | 0.020169 | 0.037896 | 0.031712 | 0.024087 | 0.014613 | ... | 0.001440 | 0.005060 | 0.003890 | 0.010174 | 0.017981 | 0.009729 | 0.018637 | 0.037059 | 0.011284 | 0.015196 |
| min | -0.058543 | -0.058683 | -0.092456 | -0.061809 | -0.095085 | -0.089791 | -0.154262 | -0.169682 | -0.113348 | -0.080502 | ... | -0.005380 | -0.019489 | -0.016328 | -0.064269 | -0.082355 | -0.049955 | -0.080727 | -0.152493 | -0.043399 | -0.073333 |
| 25% | -0.003385 | -0.003649 | -0.006667 | -0.006762 | -0.008756 | -0.009610 | -0.018546 | -0.013673 | -0.009695 | -0.005433 | ... | -0.000615 | -0.002874 | -0.001874 | -0.004473 | -0.008661 | -0.005594 | -0.011671 | -0.027269 | -0.006174 | -0.009618 |
| 50% | 0.001092 | 0.001029 | 0.001498 | 0.001270 | 0.002540 | 0.000909 | 0.001628 | 0.003672 | 0.001393 | 0.001856 | ... | 0.000121 | 0.000091 | 0.000250 | 0.001122 | 0.001030 | 0.000444 | 0.000393 | -0.001577 | 0.000204 | -0.000626 |
| 75% | 0.005975 | 0.006249 | 0.008772 | 0.009165 | 0.011635 | 0.013333 | 0.022602 | 0.021652 | 0.013098 | 0.008802 | ... | 0.000731 | 0.003190 | 0.002280 | 0.007078 | 0.011803 | 0.005966 | 0.012552 | 0.023378 | 0.006585 | 0.009261 |
| max | 0.105019 | 0.101456 | 0.153288 | 0.101337 | 0.102241 | 0.119770 | 0.226900 | 0.243701 | 0.232824 | 0.115445 | ... | 0.009974 | 0.016769 | 0.026799 | 0.036991 | 0.063932 | 0.036210 | 0.068887 | 0.144036 | 0.059657 | 0.075327 |
8 rows × 68 columns
CVaR¶
Let the portfolio return in scenario $j$ be $R_j = w^\top r_j,$ and define the loss $L_j = -R_j = -w^\top r_j.$
- Risk management is fundamentally about controlling large losses, not just average behavior.
- Variance-based risk treats upside and downside symmetrically and does not distinguish between frequent small losses and rare catastrophic ones.
- CVaR (Conditional Value-at-Risk) directly targets the worst-case tail of the loss distribution.
Value-at-Risk (VaR)¶
For a confidence level $\alpha \in (0,1)$, the Value-at-Risk is defined as the $\alpha$-quantile of the loss distribution:
$$\text{VaR}_\alpha(L) = \inf\{\gamma : \mathbb{P}(L \le \gamma) \ge \alpha\}.$$
Interpretation:
- With probability $\alpha$, the loss will not exceed $\gamma$.
- VaR tells us how bad losses can get, but not how bad they are beyond that threshold.
VaR is non-convex and difficult to optimize directly.
Conditional Value-at-Risk (CVaR)¶
CVaR addresses this limitation by measuring the expected loss conditional on exceeding VaR:
$$ \text{CVaR}_\alpha(L) = \mathbb{E}[L \mid L \ge \text{VaR}_\alpha(L)]. $$
Intuition:
- CVaR is the average of the worst $(1-\alpha)$ fraction of losses
- It explicitly penalizes tail severity, not just tail location
Importantly CVaR is Convex.
CVaR can be written as:
$$ \text{CVaR}_\alpha(L) = \min_{\gamma} \left[ \gamma + \frac{1}{1-\alpha} \mathbb{E}\big[(L - \gamma)_+\big] \right], $$ where $$ (x)_+ = \max(x, 0). $$
Interpretation:
- $\gamma$ acts as a candidate VaR level
- $(L - \gamma)_+$ measures how much losses exceed $\gamma$
- CVaR balances:
- Choosing $\gamma$ high enough to cover most losses
- Penalizing excess losses beyond $\gamma$
In practice, the loss distribution is unknown. We use historical or simulated data: $\{L_1, L_2, \dots, L_N\}.$ The expectation becomes a sample average:
$$ \text{CVaR}_\alpha(L) \approx \min_{\gamma} \left[ \gamma + \frac{1}{(1-\alpha)N} \sum_{j=1}^N (L_j - \gamma)_+ \right]. $$
The function $(L_j - \gamma)_+$ is convex but non-smooth. We linearize it using auxiliary variables $z_j$: $z_j \ge 0, \qquad z_j \ge L_j - \gamma.$ At the optimum: $z_j = \max(L_j - \gamma, 0)$. Substituting into the objective yields:
$$ \min_{w,\gamma,z} \quad \gamma + \frac{1}{(1-\alpha)N}\sum_{j=1}^N z_j. $$
- $\gamma$ controls where the tail begins (VaR estimate)
- $z_j$ penalizes only the scenarios where losses exceed $\gamma$
- The factor $\frac{1}{1-\alpha}$ rescales the tail average
- Only the worst $(1-\alpha)$ fraction of scenarios contribute meaningfully
This formulation:
- Ignores upside returns
- Ignores moderate losses
- Focuses optimization power on catastrophic events
Putting everything together:
$$ \begin{aligned} \min_{w,\gamma,z}\quad & \gamma + \frac{1}{(1-\alpha)N}\sum_{j=1}^N z_j \\ \text{s.t.}\quad & z_j \ge 0, \\ & z_j \ge -w^\top r_j - \gamma, \quad j=1,\dots,N, \\ & \sum_{i=1}^n w_i = 1, \\ & w_i \ge 0. \end{aligned} $$
This is a linear program:
- Linear objective
- Linear constraints
- Global optimum guaranteed
Pure CVaR minimization may lead to overly conservative portfolios. To enforce performance:
$$ \mathbb{E}[w^\top r] \ge \rho_{\text{target}}. $$
Using the sample mean:
$$ \frac{1}{N}\sum_{j=1}^N w^\top r_j \ge \rho_{\text{target}}. $$
This adds a linear constraint and preserves convexity.
- CVaR minimizes how bad things get when things go really bad
- The optimization explicitly models tail losses
- Auxiliary variables isolate extreme scenarios
- Convexity ensures reliability, scalability, and robustness
This makes CVaR minimization a modern risk optimization tool used in asset management, insurance, and energy markets.
CVaR minimization as a convex optimization problem¶
We define portfolio weights $w \in \mathbb{R}^n$ with constraints:
- Fully invested: $\sum_i w_i = 1$
- Long-only (typical for many ETF mandates): $w_i \ge 0$
Portfolio return on scenario $j$ is $w^\top r_j$ and portfolio loss is $L_j = -w^\top r_j$.
For confidence level $\alpha \in (0,1)$, the CVaR minimization can be written as:
$$ \min_{w,\gamma,z}\; \gamma + \frac{1}{(1-\alpha)N}\sum_{j=1}^N z_j $$ subject to $$ z_j \ge 0,\qquad z_j \ge L_j - \gamma, $$ plus portfolio constraints.
This is a linear program (LP) because everything is linear in $(w,\gamma,z)$.
In practice, you often want “min CVaR subject to at least some expected return.” We can add:
$$ \mathbb{E}[w^\top r] \ge \rho_{\text{target}} $$
Using sample mean return as an estimate.
R_train.isnull().sum().sum()
0
alpha = 0.9 # tail confidence level
N, n = R_train.shape
# Scenario loss matrix (N x n) for CVaR constraints
# L_j = - w^T r_j
R = R_train.values
# Decision variables
w = cp.Variable(n)
gamma = cp.Variable() # VaR-like threshold
z = cp.Variable(N) # tail slack variables
# Optional: minimum expected daily return constraint
mu = R_train.mean().values
# --- Convert annualized expected return to daily ---
# If returns are arithmetic daily returns, a common conversion is:
# (1 + r_annual)^(1/252) - 1
r_target_annual = 0.10
trading_days = 252
rho_target = (1 + r_target_annual)**(1 / trading_days) - 1
#rho_target = 10/(100*252) # a mild target, adaptive to data scale
print(f"Using rho_target = {rho_target:.6f} (~{100*r_target_annual}% annualized approx)")
# You can also set something like rho_target = 0.0002 (~5% annualized approx) as a fixed target.
constraints = [
cp.sum(w) == 1,
w >= 0,
z >= 0,
z >= -(R @ w) - gamma, # z_j >= L_j - gamma
mu @ w >= rho_target # expected return constraint (optional but useful)
]
objective = cp.Minimize(gamma + (1 / ((1 - alpha) * N)) * cp.sum(z))
prob = cp.Problem(objective, constraints)
# Solve
result = prob.solve(solver=cp.ECOS, verbose=False)
print("Status:", prob.status)
print("Optimal objective (CVaR proxy):", result)
Using rho_target = 0.000378 (~10.0% annualized approx) Status: optimal Optimal objective (CVaR proxy): 0.002875933039936678
weights_cvar = pd.Series(w.value, index=R_train.columns).sort_values(ascending=False)
weights_cvar
SHY.US 7.991167e-01
WMT.US 4.165347e-02
META.US 3.078826e-02
JPM.US 3.008830e-02
V.US 2.925812e-02
...
AMD.US 1.997300e-16
ADBE.US 1.490573e-16
UNG.US 1.171536e-16
INTC.US 8.628544e-17
ICLN.US -5.853315e-18
Length: 68, dtype: float64
Baseline: Mean–Variance Portfolio Optimization from Case Study II¶
Compared to Case Study II, we use the minimum-variance formulation with a return constraint rather than a weighted mean–variance objective.
# --- Estimates from training data (daily returns) ---
mu = R_train.mean().values # daily mean returns (n,)
Sigma = R_train.cov().values # daily covariance (n,n)
n = len(returns.columns)
w = cp.Variable(n)
# --- Min-variance portfolio with target return constraint ---
objective = cp.Minimize(cp.quad_form(w, Sigma))
constraints = [
cp.sum(w) == 1, # fully invested
w >= 0, # long-only
w <= 0.25, # max 25% per asset
mu @ w >= rho_target # target expected return (daily)
]
prob = cp.Problem(objective, constraints)
prob.solve(solver=cp.OSQP) # OSQP is a good default for QPs
print("Status:", prob.status)
print("Target daily return:", rho_target)
weights_baseline = w.value
weights_b = pd.Series(weights_baseline, index=R_train.columns).sort_values(ascending=False)
weights_b
Status: optimal Target daily return: 0.0003782865315342665
HYG.US 2.500000e-01
SHY.US 2.500000e-01
IEF.US 1.593812e-01
WMT.US 7.535102e-02
GLD.US 4.291713e-02
...
AAPL.US -8.946613e-18
NFLX.US -9.181773e-18
HD.US -1.484294e-17
LQD.US -3.260698e-17
TLT.US -5.775274e-17
Length: 68, dtype: float64
Out-of-sample backtest¶
We compute:
- Daily portfolio returns on the test set
- Cumulative returns
- Realized VaR/CVaR estimates from test returns
Important: CVaR is a tail quantity; to estimate it reliably you want enough test observations.
If your test window is short, interpret tail metrics with care.
def portfolio_returns(R_df: pd.DataFrame, w: pd.Series) -> pd.Series:
w = w.reindex(R_df.columns).fillna(0.0)
return (R_df * w.values).sum(axis=1)
# Test portfolio returns
p_cvar_train = portfolio_returns(R_train, weights_cvar)
p_b_train = portfolio_returns(R_train, weights_b)
p_cvar = portfolio_returns(R_test, weights_cvar)
p_b = portfolio_returns(R_test, weights_b)
def var_cvar(x: pd.Series, alpha=0.95):
# x is returns; losses are -x
losses = -x.values
var = np.quantile(losses, alpha)
cvar = losses[losses >= var].mean()
return var, cvar
var_cvar_cvar = var_cvar(p_cvar, alpha=alpha)
var_cvar_b = var_cvar(p_b, alpha=alpha)
print(f"Test VaR@{alpha:.2f} (CVaR-opt): {var_cvar_cvar[0]:.4f} | Test CVaR@{alpha:.2f}: {var_cvar_cvar[1]:.4f}")
print(f"Test VaR@{alpha:.2f} (Equal-wt): {var_cvar_b[0]:.4f} | Test CVaR@{alpha:.2f}: {var_cvar_b[1]:.4f}")
Test VaR@0.90 (CVaR-opt): 0.0025 | Test CVaR@0.90: 0.0040 Test VaR@0.90 (Equal-wt): 0.0037 | Test CVaR@0.90: 0.0058
# Cumulative returns
cum_cvar = (1 + p_cvar_train).cumprod()
cum_b = (1 + p_b_train).cumprod()
plt.figure(figsize=(10,5))
plt.plot(cum_cvar.index, cum_cvar, label="CVaR-min (test)")
plt.plot(cum_b.index, cum_b, label="Mean-variance (test)")
plt.title("Insample cumulative growth")
plt.xlabel("Date")
plt.ylabel("Growth of $1")
plt.legend()
plt.grid(True)
plt.show()
# Cumulative returns
cum_cvar = (1 + p_cvar).cumprod()
cum_b = (1 + p_b).cumprod()
plt.figure(figsize=(10,5))
plt.plot(cum_cvar.index, cum_cvar, label="CVaR-min (test)")
plt.plot(cum_b.index, cum_b, label="Mean-variance (test)")
plt.title("Out-of-sample cumulative growth")
plt.xlabel("Date")
plt.ylabel("Growth of $1")
plt.legend()
plt.grid(True)
plt.show()
Tail-loss visualization¶
import numpy as np
import matplotlib.pyplot as plt
def plot_tail_comparison_shaded(
losses_a,
losses_b,
alpha=0.95,
label_a="CVaR-min",
label_b="Mean–Variance",
title="Tail risk comparison (test set)"
):
# Compute VaR / CVaR
var_a = np.quantile(losses_a, alpha)
cvar_a = losses_a[losses_a >= var_a].mean()
var_b = np.quantile(losses_b, alpha)
cvar_b = losses_b[losses_b >= var_b].mean()
# Common bins for fair comparison
xmin = min(losses_a.min(), losses_b.min())
xmax = max(losses_a.max(), losses_b.max())
bins = np.linspace(xmin, xmax, 80)
plt.figure(figsize=(11, 5))
# Histograms (density)
plt.hist(
losses_a, bins=bins, density=True, alpha=0.6,
label=label_a
)
plt.hist(
losses_b, bins=bins, density=True, alpha=0.6,
label=label_b
)
# VaR lines
plt.axvline(
var_a, linestyle="--", linewidth=2,
label=f"{label_a} VaR@{alpha:.2f}"
)
plt.axvline(
var_b, linestyle="--", linewidth=2,
label=f"{label_b} VaR@{alpha:.2f}"
)
# Shade tail regions
y_max = plt.ylim()[1]
plt.fill_betweenx(
[0, y_max],
var_a,
xmax,
alpha=0.15
)
plt.fill_betweenx(
[0, y_max],
var_b,
xmax,
alpha=0.15
)
# CVaR annotations (not vertical lines)
plt.text(
cvar_a,
y_max * 0.85,
f"'{label_a}' - Avg Loss in Tail = {100*cvar_a:.2f}%",
rotation=90,
va="top",
ha="right"
)
plt.text(
cvar_b,
y_max * 0.85,
f"'{label_b}' - Avg Loss in Tail = {100*cvar_b:.2f}%",
rotation=90,
va="top",
ha="right"
)
# Labels and styling
plt.title(title)
plt.xlabel("Daily loss")
plt.ylabel("Density")
plt.legend()
plt.grid(True)
plt.show()
plot_tail_comparison_shaded(
losses_a = -p_cvar_train.values,
losses_b = -p_b_train.values,
alpha = alpha,
title = "Tail risk comparison: CVaR vs Mean–Variance (test)"
)
plot_tail_comparison_shaded(
losses_a = -p_cvar.values,
losses_b = -p_b.values,
alpha = alpha,
title = "Tail risk comparison: CVaR vs Mean–Variance (test)"
)
VI — Matchmaking - An Optimal Transport Case Study¶
Suppose we have two groups of people participating in speed dating event.
- For each person we can build a feature vector (age, interests, stated preferences, etc.).
- For any pair (person from group A, person from group B), we can compute a compatibility cost: low cost means "good match".
What we want: a global matching policy that answers:
"How should mass (attention / recommendations) flow from group A to group B to minimize total incompatibility, while treating everyone fairly?"
Optimal Transport is a natural fit because it enforces global consistency:
- Everyone in group A must distribute all their mass.
- Everyone in group B must receive the right total mass.
This prevents a common failure mode of naive scoring systems: everyone gets matched to the same few "top" profiles.
A common baseline is:
- Score every possible pair.
- For each person, recommend the top-k matches by score.
This is not a valid matching policy at the population level:
- It ignores competition (many people choose the same top profiles).
- It ignores scarcity (limited attention / limited capacity).
- It can create extreme exposure imbalance (a few people receive nearly all recommendations).
Optimal Transport fixes this by adding hard constraints (marginals) and solving a global optimization.
Data source¶
We use the public SpeedDating dataset (participants in speed dating events, 2002–2004) Link
Each row corresponds to an interaction (a date) between two participants and includes:
- participant id (
iid) and partner id (pid) - participant attributes (e.g.
age,field, ...) - participant stated preference weights (e.g. how much they value attractiveness, sincerity, ...)
- partner ratings (e.g.
attr_o,intel_o, ...) - decision / match outcome (e.g.
dec,match)
#!pip install POT
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import re
from sklearn.preprocessing import StandardScaler
import ot
ot.__version__
'0.9.6.post1'
# Load the Speed Dating dataset
df = pd.read_csv('data/SpeedDating.csv')
df.shape, df.columns[:20]
((8378, 123),
Index(['has_null', 'wave', 'gender', 'age', 'age_o', 'd_age', 'd_d_age',
'race', 'race_o', 'samerace', 'importance_same_race',
'importance_same_religion', 'd_importance_same_race',
'd_importance_same_religion', 'field', 'pref_o_attractive',
'pref_o_sincere', 'pref_o_intelligence', 'pref_o_funny',
'pref_o_ambitious'],
dtype='object'))
Build person-level feature vectors¶
The raw table is interaction-level (one row per date). For OT we want one vector per person.
We'll construct a compact numeric feature vector per participant using:
- basic demographics (e.g. age)
- stated preference weights (e.g. how much they care about attractiveness / intelligence / etc.)
We then split into two distributions:
- $X = \{x_i\}$: men
- $Y = \{y_j\}$: women
(The dataset is heterosexual speed dating; we treat gender as the group indicator.)
df['gender'].value_counts(dropna=False).head()
gender male 4194 female 4184 Name: count, dtype: int64
candidate_cols = [
'gender', #Gender of self
'age', #Age of self
'race', #Race of self
'importance_same_race', #How important is it that partner is of same race?
'importance_same_religion', #How important is it that partner has same religion?
'field', #Field of study
'pref_o_attractive', #How important does partner rate attractiveness
'pref_o_sincere', #How important does partner rate sincerity
'pref_o_intelligence', #How important does partner rate intelligence
'pref_o_funny', #How important does partner rate being funny
'pref_o_ambitious', #How important does partner rate ambition
'pref_o_shared_interests', #How important does partner rate having shared interests
'attractive_important', #What do you look for in a partner - attractiveness
'sincere_important', #What do you look for in a partner - sincerity
'intellicence_important', #What do you look for in a partner - intelligence
'funny_important', #What do you look for in a partner - being funny
'ambtition_important', #What do you look for in a partner - ambition
'shared_interests_important', #What do you look for in a partner - shared interests
'attractive', #Rate yourself - attractiveness
'sincere', #Rate yourself - sincerity
'intelligence', #Rate yourself - intelligence
'funny', #Rate yourself - being funny
'ambition', #Rate yourself - ambition
'sports', #Your own interests [1-10]
'tvsports', #Your own interests [1-10]
'exercise', #Your own interests [1-10]
'dining', #Your own interests [1-10]
'museums', #Your own interests [1-10]
'art', #Your own interests [1-10]
'hiking', #Your own interests [1-10]
'gaming', #Your own interests [1-10]
'clubbing', #Your own interests [1-10]
'reading', #Your own interests [1-10]
'tv', #Your own interests [1-10]
'theater', #Your own interests [1-10]
'movies', #Your own interests [1-10]
'concerts', #Your own interests [1-10]
'music', #Your own interests [1-10]
'shopping', #Your own interests [1-10]
'yoga', #Your own interests [1-10]
'expected_happy_with_sd_people', #How happy do you expect to be with the people you meet during the speed-dating event?
'expected_num_interested_in_me', #Out of the 20 people you will meet, how many do you expect will be interested in dating you?
'expected_num_matches' #How many matches do you expect to get?
]
not_available_cols = [c for c in candidate_cols if c not in df.columns]
available_cols = [c for c in candidate_cols if c in df.columns]
print("Not available columns:", not_available_cols)
Not available columns: []
df_relevant = df[available_cols].copy()
# check for missing values
df_relevant.isnull().sum()
# remove null values
df_relevant = df_relevant.dropna().reset_index(drop=True)
#df_relevant.isnull().sum()
df_relevant.shape
df_relevant
men = df_relevant[df_relevant["gender"] == 'female'].copy()
women = df_relevant[df_relevant["gender"] == 'male'].copy()
men = men.drop(columns=['gender'])
women = women.drop(columns=['gender'])
# preprocess df
def preprocess_mixed_table(df: pd.DataFrame, cols=None) -> pd.DataFrame:
"""
Minimal preprocessing for mixed-type columns.
Rules:
1) Detect column type by inspection.
2) Numeric columns: impute missing with MODE.
3) "Range-of-integers" string columns (e.g. '6-10'): convert to midpoint float,
then impute missing with MEDIAN.
4) Categorical columns: one-hot encode; missing -> all zeros (i.e., no 'missing' dummy).
Returns:
A new DataFrame with:
- numeric columns kept (imputed)
- range columns converted to numeric (imputed)
- categorical columns replaced by one-hot columns
"""
df = df.copy()
if cols is None:
cols = list(df.columns)
else:
cols = [c for c in cols if c in df.columns]
# --- helpers ---
range_pat = re.compile(r"^\s*(\d+)\s*-\s*(\d+)\s*$")
def is_range_string_series(s: pd.Series, sample_n=200, min_hits=3) -> bool:
# detect columns that look like "6-10"
x = s.dropna().astype(str)
if len(x) == 0:
return False
x = x.sample(min(sample_n, len(x)), random_state=0)
hits = sum(bool(range_pat.match(v)) for v in x)
return hits >= min_hits
def range_to_midpoint(v):
if pd.isna(v):
return np.nan
if isinstance(v, (int, float, np.number)):
return float(v)
s = str(v).strip()
m = range_pat.match(s)
if m:
lo, hi = float(m.group(1)), float(m.group(2))
return 0.5 * (lo + hi)
# try numeric string
try:
return float(s)
except Exception:
return np.nan
out_parts = []
for c in cols:
s = df[c]
# Numeric dtype → mode impute
if pd.api.types.is_numeric_dtype(s):
mode_vals = s.dropna().mode()
fill = mode_vals.iloc[0] if len(mode_vals) else 0.0
out_parts.append(s.fillna(fill).to_frame(c))
continue
# Object/category: check if it's mostly "a-b" integer range
if is_range_string_series(s):
s_num = s.apply(range_to_midpoint).astype(float)
med = float(np.nanmedian(s_num.values)) if np.isfinite(s_num.values).any() else 0.0
out_parts.append(s_num.fillna(med).to_frame(c))
continue
# Otherwise treat as categorical: one-hot; missing => all zeros
s_cat = s.astype("object")
dummies = pd.get_dummies(s_cat, prefix=c, dummy_na=False).astype(int)
# Ensure missing rows become all-zeros
missing_mask = s_cat.isna()
if missing_mask.any() and dummies.shape[1] > 0:
dummies.loc[missing_mask, :] = 0
out_parts.append(dummies)
return pd.concat(out_parts, axis=1)
df_preprocessed = preprocess_mixed_table(df_relevant)
print(df_preprocessed.shape)
print(df_preprocessed.isna().sum().sort_values(ascending=False).head())
non_numeric = df_preprocessed.select_dtypes(exclude=[np.number]).columns
print(non_numeric)
men = df_preprocessed[df_preprocessed["gender_male"] == 1].copy()
women = df_preprocessed[df_preprocessed["gender_female"] == 1].copy()
men = men.drop(columns=['gender_male', 'gender_female'])
women = women.drop(columns=['gender_male', 'gender_female'])
(1452, 103) gender_female 0 field_psychology 0 funny_important 0 intellicence_important 0 sincere_important 0 dtype: int64 Index([], dtype='object')
pd.set_option("display.max_rows", None)
pd.set_option("display.max_columns", None)
men_desc = men.describe().T
women_desc = women.describe().T
# Add top-level column labels
men_desc.columns = pd.MultiIndex.from_product([["Men"], men_desc.columns])
women_desc.columns = pd.MultiIndex.from_product([["Women"], women_desc.columns])
# Concatenate side by side
desc_side_by_side = pd.concat([men_desc, women_desc], axis=1)
desc_side_by_side
import matplotlib.pyplot as plt
features = men_desc.index # or a subset list
men_means = men_desc[("Men", "mean")]
women_means = women_desc[("Women", "mean")]
x = range(len(features))
width = 0.35
plt.figure(figsize=(20, 10))
plt.bar(x, men_means, width, label="Men")
plt.bar([i + width for i in x], women_means, width, label="Women")
plt.xticks([i + width/2 for i in x], features, rotation=90, ha="right")
plt.ylabel("Mean value")
plt.title("Feature-wise mean comparison")
plt.legend()
plt.tight_layout()
plt.show()
X_raw.shape, Y_raw.shape
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[10], line 1 ----> 1 X_raw.shape, Y_raw.shape NameError: name 'X_raw' is not defined
X_raw = men.values
Y_raw = women.values
# Standardize features (important for Euclidean costs)
scaler = StandardScaler()
scaler.fit(np.vstack([X_raw, Y_raw]))
X = scaler.transform(X_raw)
Y = scaler.transform(Y_raw)
X.shape, Y.shape
((718, 101), (734, 101))
Formulation:¶
We have two populations represented as feature vectors embedded in a shared metric space:
- Men: $X = \{x_i \in \mathbb{R}^d\}_{i=1}^n$
- Women: $Y = \{y_j \in \mathbb{R}^d\}_{j=1}^m$
Each vector is a standardized profile (e.g., age, preferences, interests). Our goal is not to predict individual outcomes, but to compute a globally consistent, population-level matching between the two distributions.
Cost Function¶
We define a pairwise compatibility cost: $$ C_{ij} = \|x_i - y_j\|_2^2 $$
Lower cost indicates higher similarity.
The cost matrix $C$ is fixed and exogenous; defining it is a modeling choice made prior to optimization.
Decision Variable¶
We optimize over a transport plan: $$ \Pi \in \mathbb{R}_+^{n \times m} $$
$\Pi_{ij}$ represents the amount of probability mass assigned between man $i$ and woman $j$.
$\Pi$ is a soft coupling (a joint distribution over pairs), not a one-to-one assignment.
Marginal Constraints (Population Balance)¶
Let $a \in \Delta^n$ and $b \in \Delta^m$ be probability weights (typically uniform): $$ a_i = \tfrac{1}{n}, \quad b_j = \tfrac{1}{m} $$
The transport plan must satisfy: $$ \Pi \mathbf{1}_m = a, \qquad \Pi^\top \mathbf{1}_n = b $$
Thus, each individual contributes and receives the correct total mass in aggregate.
Optimal Transport Objective¶
Unregularized OT (Linear Program): $$ \min_{\Pi \ge 0} \ \langle C, \Pi \rangle \quad \text{s.t.} \quad \Pi \mathbf{1}_m = a,\ \Pi^\top \mathbf{1}_n = b $$
This is a linear program. Its solutions often lie at sparse extreme points and may be non-unique.
Entropic Optimal Transport (Regularized, Practical)¶
To obtain a smooth, stable, and computationally efficient solution, we add entropy regularization: $$ \min_{\Pi \ge 0} \ \langle C, \Pi \rangle - \varepsilon H(\Pi), \qquad H(\Pi) = -\sum_{i,j} \Pi_{ij}\log \Pi_{ij} $$
Equivalently: $$ \min_{\Pi \ge 0} \ \langle C, \Pi \rangle + \varepsilon \sum_{i,j} \Pi_{ij}(\log \Pi_{ij}-1) $$
where $\varepsilon > 0$ controls the smoothness of the matching.
This problem is strictly convex (under mild conditions) and admits a unique solution. This is a well-defined convex optimization problem and is solved with Sinkhor iteration algorithm.
Sinkhorn iterations are the numerical algorithm used to compute the solution of the entropically regularized optimal transport problem.
- Decision variable: $\Pi$ (a joint distribution over pairs)
- Objective: minimize total mismatch cost while encouraging smooth matchings
- Constraints: preserve population-level balance
- Problem class: convex optimization
- unregularized OT: linear program
- entropic OT: strictly convex, efficiently solvable
import numpy as np
import ot # POT
n, m = X.shape[0], Y.shape[0]
a = np.ones(n) / n
b = np.ones(m) / m
# Compute the cost matrix C_ij = ||x_i - y_j||^2
# POT provides fast distance computation:
C = ot.dist(X, Y, metric="euclidean") ** 2 # shape (n, m)
# Optional: cost normalization (often improves numerical stability)
C = C / C.max()
# Choose entropy regularization strength (epsilon)
# Larger reg => smoother plan; smaller reg => sharper plan but more numerically sensitive
reg = 0.001
# Solve entropic OT with Sinkhorn
Pi = ot.sinkhorn(a, b, C, reg, numItermax=5000, stopThr=1e-9)
Pi.shape, Pi.min(), Pi.max(), Pi.sum()
((718, 734), 0.0, 0.0013623977870042849, 1.0)
Interpreting OT Results (with Baselines)¶
We compare optimal transport to two baselines:
- random matching, which ignores costs,
- a nearest-neighbor baseline, which matches each man to the most similar woman based on pairwise distance only.
Random matching distributes mass independently of cost, while nearest-neighbor concentrates mass on low-cost pairs but can overuse the same popular profiles and violate global balance. In contrast, optimal transport concentrates mass on low-cost pairs while simultaneously enforcing balanced participation across the entire population.
This highlights the key advantage of OT: it performs global cost minimization under balancing constraints, whereas baselines are either cost-blind (random) or locally optimal but globally inconsistent (nearest-neighbor).
Random transport baseline for comparison¶
def random_transport(a, b, seed=0):
rng = np.random.default_rng(seed)
Pi_rand = rng.random((len(a), len(b)))
Pi_rand = Pi_rand / Pi_rand.sum(axis=1, keepdims=True)
Pi_rand = Pi_rand * a[:, None]
return Pi_rand
Pi_rand = random_transport(a, b, seed=0)
Pi_rand.sum()
1.0000000000000002
Nearest-neighbor (greedy) baseline transport¶
def nearest_neighbor_transport(C, a, b=None):
"""
Nearest-neighbor (greedy) baseline transport.
For each source i (man), send all its mass a_i to the single closest target j*.
This uses pairwise costs only and does NOT enforce column marginals.
Args:
C: (n, m) cost matrix
a: (n,) source weights (e.g., uniform 1/n)
b: unused (kept for interface symmetry)
Returns:
Pi_nn: (n, m) transport plan with exactly one nonzero per row.
"""
n, m = C.shape
j_star = np.argmin(C, axis=1) # closest woman for each man
Pi_nn = np.zeros((n, m), dtype=float)
Pi_nn[np.arange(n), j_star] = a # send all mass to nearest neighbor
return Pi_nn
# Example usage:
Pi_nn = nearest_neighbor_transport(C, a)
Pi_nn.shape, Pi_nn.sum(), Pi_nn.min(), Pi_nn.max()
((718, 734), 0.9999999999999996, 0.0, 0.001392757660167131)
def plot_mass_weighted_cost_hist(Pi_rand, Pi_nn, Pi_ot, C, bins=100):
# Flatten everything
C_flat = C.ravel()
Pr = Pi_rand.ravel()
Pn = Pi_nn.ravel()
Po = Pi_ot.ravel()
plt.figure(figsize=(8,5))
plt.hist(
C_flat, bins=bins, weights=Pr,
alpha=0.5, density=True, label="Random baseline"
)
plt.hist(
C_flat, bins=bins, weights=Pn,
alpha=0.5, density=True, label="Nearest neighbor"
)
plt.hist(
C_flat, bins=bins, weights=Po,
alpha=0.5, density=True, label="Optimal Transport"
)
plt.xlabel("Cost $C_{ij}$")
plt.ylabel("Mass-weighted density")
plt.title("How Different Matching Methods allocate 'matches' Across Costs")
plt.legend()
plt.grid(True)
# --- Interpretation box ---
text = (
"Random baseline:\n"
"• Mass mirrors the underlying cost distribution\n"
"• No preferential concentration at low cost\n\n"
"Nearest neighbor:\n"
"• Mass collapses at minimum cost\n\n"
"Optimal Transport:\n"
"• Mass shifted toward low cost\n"
"• Balanced across the population"
)
plt.text(
0.55, 0.95, text,
transform=plt.gca().transAxes,
fontsize=9,
va="top",
bbox=dict(boxstyle="round", facecolor="white", alpha=0.9)
)
plt.tight_layout()
plt.show()
plot_mass_weighted_cost_hist(Pi_rand, Pi_nn, Pi, C, bins=100)
This figure shows how different matching methods allocate transport mass across pairwise costs. The random baseline assigns mass independently of cost, so its distribution simply reflects the natural distribution of distances. Nearest-neighbor matching collapses mass onto the lowest-cost pairs, ignoring global balance. Optimal transport shifts mass toward low-cost pairs while maintaining balanced participation across the population.
How evenly does each method distribute participation across women?¶
def plot_participation_boxplot(Pi_rand, Pi_nn, Pi_ot):
r_rand = Pi_rand.sum(axis=0)
r_nn = Pi_nn.sum(axis=0)
r_ot = Pi_ot.sum(axis=0)
data = [r_rand, r_nn, r_ot]
labels = ["Random baseline", "Nearest neighbor", "Optimal Transport"]
plt.figure(figsize=(7,5))
plt.boxplot(
data,
labels=labels,
showfliers=True,
whis=(5, 95)
)
plt.ylabel("Received mass per woman")
plt.title("Participation Distribution Across Women")
plt.grid(True, axis="y")
plt.tight_layout()
plt.show()
plot_participation_boxplot(Pi_rand, Pi_nn, Pi)
The box plot summarizes how participation is distributed across individuals. Nearest-neighbor matching produces extreme imbalance, with many individuals receiving no mass, while optimal transport tightly concentrates participation around the target level.
Convex Optimization
Chapter 1: Introduction and Overview¶
Optimization is at the heart of most machine-learning methods. Whether training a linear model or a deep neural network, learning usually means adjusting parameters to minimize a loss that measures how well the model fits the data. Convex optimization is a particularly important and well-understood part of optimization. When both the objective and the constraints are convex, the problem has helpful properties:
- No bad local minima: any local minimum is also the global minimum.
- Predictable behavior: algorithms like gradient descent have clear and well-studied convergence.
- Solutions are easy to verify: convex problems come with simple mathematical conditions that tell us when we have reached the optimum.
Motivation: Optimization in Machine Learning¶
Many supervised learning problems can be written in a common form:
where
- \(\ell(\cdot,\cdot)\) is a loss function that measures how well the model predicts \(b_i\) from \(a_i\),
- \(R(x)\) is a regularizer that encourages certain structure (such as sparsity or small weights),
- \(\mathcal{X}\) is a set of allowed parameter values, often simple and convex.
Many widely used losses and regularizers are convex. Examples include least squares, logistic loss, hinge loss, Huber loss, the \(\ell_1\) norm, and the \(\ell_2\) norm. Convexity is what makes these problems tractable and allows them to be solved efficiently at scale using well-behaved optimization algorithms.
Why Convex Optimization Remains Central in ML¶
Although many modern models are nonconvex, convex optimization continues to play a major role:
-
Convex surrogate losses: Losses such as logistic, hinge, and Huber are convex substitutes for harder objectives like the \(0\text{–}1\) loss. They make optimization practical while still leading to models that generalize well.
-
Convex subproblems inside larger algorithms: Many nonconvex methods solve convex problems as part of their inner loop. Examples include least-squares steps in matrix factorization, proximal updates in regularized learning, and simple convex problems that appear in line-search procedures.
These roles make convex optimization a key component of modern ML toolkits, even when the main model is nonconvex.
Web-Book Roadmap and How to Use It¶
-
Chapter 2: Linear Algebra Foundations. Basic vector/matrix operations and linear algebra needed for optimization.
-
Chapter 3: Multivariable Calculus. Differentiation and derivatives of functions of many variables (gradients, Hessians).
-
Chapter 4: Convex Sets and Geometry. Definitions and examples of convex sets, cones, affine spaces, and geometric properties.
-
Chapter 5: Convex Functions. Convexity for functions, epigraphs, and key examples (norms, quadratic functions, log-sum-exp, etc.).
-
Chapter 6: Nonsmooth Optimization – Subgradients. Generalized derivatives for convex functions that are not differentiable, and subgradient methods.
-
Chapter 7: First-Order Optimality Conditions. Gradient-based optimality for smooth problems, supporting theory for necessary and sufficient conditions.
-
Chapter 8: Optimization Principles – From Gradient Descent to KKT. Unconstrained and constrained gradient methods, culminating in the Karush–Kuhn–Tucker (KKT) conditions.
-
Chapter 9: Lagrange Duality Theory. Duality principles, weak/strong duality, and interpretations of Lagrange multipliers.
-
Chapter 10: Pareto Optimality and Multi-Objective Optimization. Trade-offs in optimizing multiple goals and Pareto efficiency.
-
Chapter 11: Regularized Approximation. Balancing fit vs. complexity with regularization (ℓ₁, ℓ₂, elastic net, etc.).
-
Chapter 12: Algorithms for Convex Optimization. General convex optimization solvers and algorithmic frameworks (interior-point, gradient methods, etc.).
-
Chapter 13: Equality-Constrained Problems. Specialized methods (e.g. KKT with only equalities, reduced-space methods).
-
Chapter 14: Inequality-Constrained Problems. Algorithms handling general inequality constraints, barrier methods.
-
Chapter 15: Advanced Large-Scale and Structured Methods. Techniques for very large or structured problems (decomposition, coordinate descent, etc.).
-
Chapter 16: Modeling Patterns and Algorithm Selection. Practical guidance on modeling choices and selecting the right algorithm in practice.
-
Chapter 17: Canonical Problems in Convex Optimization. Well-known problem templates (linear, quadratic, SOCP, SDP) and how to recognize them.
-
Chapter 18: Modern Optimizers in Machine Learning Frameworks. How convex optimization appears in ML libraries and frameworks.
-
Chapter 19: Beyond Convexity – Nonconvex and Global Optimization. Overview of nonconvex problems and global methods (to contrast with convex theory).
-
Chapter 20: Derivative-Free and Black-Box Optimization. Techniques when gradients are not available.
-
Chapter 21: Metaheuristic and Evolutionary Optimization. Heuristic algorithms (genetic algorithms, simulated annealing) for hard problems.
-
Chapter 22: Advanced Topics in Combinatorial Optimization. Combinatorial optimization problems and convex relaxations.
This roadmap helps the reader see how the material progresses from foundations (Ch.2–5) to theory (Ch.6–11) to algorithms (Ch.12–15) and on to specialized and modern topics (Ch.16–22).
Acknowledgments¶
The content and structure of this web book are strongly informed by the Stanford University course EE364A: Convex Optimization I, taught by Stephen Boyd.
Chapter 2: Linear Algebra Foundations¶
Linear algebra provides the geometric language of convex optimization. Many optimization problems in machine learning can be understood as asking how vectors, subspaces, and linear maps relate to one another. This chapter develops the linear-algebra tools that appear throughout convex optimization and machine learning. We focus on geometric ideas: projections, subspaces, orthogonality, eigenvalues, singular values, and norms because these ideas directly shape how optimization behaves.
Vector spaces, subspaces, and affine sets¶
A vector space over \(\mathbb{R}\) is a set of vectors that can be added and scaled without leaving the set. The familiar example is \(\mathbb{R}^n\), where operations like \(\alpha x + \beta y\) keep us within the same space.
Within a vector space, some subsets behave particularly nicely. A subspace is a subset that is itself a vector space: it is closed under addition, closed under scalar multiplication, and contains the zero vector. Geometrically, subspaces are “flat” objects that always pass through the origin, such as lines or planes in \(\mathbb{R}^3\).
Affine sets extend this idea by allowing a shift away from the origin. A set \(A\) is affine if it contains the entire line passing through any two of its points. Equivalently, for any \(x,y \in A\) and any \(\theta \in \mathbb{R}\), \(\theta x + (1 - \theta) y \in A.\) That is, the entire line passing through any two points in \(A\) lies within \(A\). By contrast, a convex set only requires this property for \(\theta \in [0,1]\), meaning only the line segment between \(x\) and \(y\) must lie within the set.
Affine sets look like translated subspaces: lines or planes that do not need to pass through the origin. Every affine set can be written as: \(A = x_0 + S = \{\, x_0 + s : s \in S \,\},\) where \(S\) is a subspace and \(x_0\) is any point in the set. This representation is extremely useful in optimization. If \(Ax = b\) is a linear constraint, then its solution set is an affine set. A single particular solution \(x_0\) gives one point satisfying the constraint, and the entire solution set is obtained by adding the null space of \(A\). Thus, optimization under linear constraints means searching over an affine set determined by the constraint structure.
Affine Transformations: An affine transformation (or affine map) is a function \(f : V \to W\) that can be written as \(f(x) = A x + b,\) where \(A\) is a linear map and \(b\) is a fixed vector. Affine transformations preserve both affinity and convexity: if \(C\) is convex, then \(A C + b\) is also convex. is called an affine transformation. It represents a linear transformation followed by a translation. Affine transformations preserve the structure of affine sets and convex sets, meaning that if a feasible region is convex or affine, applying an affine transformation does not destroy that property. This matters for optimization because many models and algorithms implicitly perform affine transformations for example, when reparameterizing variables, scaling features, or mapping between coordinate systems. Convexity is preserved under these operations, so the essential geometry of the problem remains intact.
Linear combinations, span, basis, dimension¶
Much of linear algebra revolves around understanding how vectors can be combined to generate new vectors. This idea is essential in optimization because gradients, search directions, feasible directions, and model predictions are often built from linear combinations of simpler components.
Given vectors \(v_1,\dots,v_k\), any vector of the form is a linear combination. The set of all linear combinations is called the span: The span describes the collection of directions that can be reached from these vectors and therefore determines what portion of the ambient space they can represent.
The concept of linear independence formalizes when a set of vectors contains no redundancy. A set of vectors is linearly independent if none of them can be written as a linear combination of the others. If a set is linearly dependent, at least one vector adds no new direction.
- A basis of a space \(V\) is a linearly independent set whose span equals \(V\).
- The number of basis vectors is the dimension \(\dim(V)\).
- The column space of \(A\) is the span of its columns. Its dimension is \(\mathrm{rank}(A)\).
- The nullspace of \(A\) is \(\{ x : Ax = 0 \}\).
- The rank-nullity theorem states: \(\mathrm{rank}(A) + \mathrm{nullity}(A) = n,\) where \(n\) is the number of columns of \(A\).
Figure : Four Fundamental Subspaces: MIT Linear Algebrea
Column Space: The column space of a matrix , denoted , is the set of all possible output vectors that can be written as for some . In other words, it contains all vectors that the matrix can “reach” through linear combinations of its columns. The question “Does the system have a solution?” is equivalent to asking whether . If lies in the column space, a solution exists; otherwise, it does not.
Null Space: The null space (or kernel) of , denoted , is the set of all input vectors that are mapped to zero: . It answers a different question: If a solution to exists, is it unique? If the null space contains only the zero vector (), the solution is unique. But if contains nonzero vectors, there are infinitely many distinct solutions that yield the same output.
Multicollinearity: When one feature in the data matrix is a linear combination of others for example, —the columns of become linearly dependent. This creates a nonzero vector in the null space of , meaning multiple weight vectors can produce the same predictions. The model is then unidentifiable (Underdetermined – the number of unknowns (parameters) exceeds the number of independent equations (information)), and becomes singular (non-invertible). Regularization methods such as Ridge or Lasso regression are used to resolve this ambiguity by selecting one stable, well-behaved solution.
Regularization introduces an additional constraint or penalty that selects a single, stable solution from among the infinite possibilities.
Ridge regression (L2 regularization) adds a penalty on the norm of \(x\): which modifies the normal equations to The added term \(\lambda I\) ensures invertibility and numerical stability.
Lasso regression (L1 regularization) instead penalizes \(\|x\|_1\), promoting sparsity by driving some coefficients exactly to zero.
Thus, regularization resolves ambiguity by imposing structure or preference on the solution favoring smaller or sparser coefficient vectors—and making the regression problem well-posed even when \(A\) is rank-deficient.
Feasible Directions: In a constrained optimization problem of the form , the null space of characterizes the directions along which one can move without violating the constraints. If , then moving from a feasible point to preserves feasibility, since . Thus, the null space defines the space of free movement directions in which optimization algorithms can explore solutions while remaining within the constraint surface.
Row Space: The row space of , denoted , is the span of the rows of (viewed as vectors). It represents all possible linear combinations of the rows and has the same dimension as the column space, equal to . The row space is orthogonal to the null space of : . In optimization, the row space corresponds to the set of active constraints or the directions along which changes in affect the constraints.
Left Null Space: The left null space, denoted , is the set of all vectors such that . These vectors are orthogonal to the columns of , and therefore orthogonal to the column space itself. In least squares problems, represents residual directions—components of that cannot be explained by the model .
Projection Interpretation (Least Squares): When has no exact solution (as in overdetermined systems), the least squares solution finds such that is the projection of onto the column space of :
and the residual
lies in the left null space .
This provides a geometric view: the solution projects onto the closest point in the subspace that can reach.Rank–Nullity Relationship: The rank of is the dimension of both its column and row spaces, and the nullity is the dimension of its null space. Together they satisfy the Rank–Nullity Theorem: where is the number of columns of . This theorem reflects the balance between the number of independent constraints and the number of degrees of freedom in .
Geometric Interpretation:
- The column space represents all reachable outputs.
- The null space represents all indistinguishable inputs that map to zero.
- The row space represents all independent constraints imposed by .
- The left null space captures inconsistencies or residual directions that cannot be explained by the model.
Together, these four subspaces define the complete geometry of the linear map .
Inner products and orthogonality¶
Inner products provide the geometric structure that underlies most optimization algorithms. They allow us to define lengths, angles, projections, gradients, and orthogonality. An inner product on \(\mathbb{R}^n\) is a map \(\langle \cdot,\cdot\rangle : \mathbb{R}^n \times \mathbb{R}^n \to \mathbb{R}\) such that for all \(x,y,z\) and all scalars \(\alpha\):
- \(\langle x,y \rangle = \langle y,x\rangle\) (symmetry),
- \(\langle x+y,z \rangle = \langle x,z \rangle + \langle y,z\rangle\) (linearity in first argument),
- \(\langle \alpha x, y\rangle = \alpha \langle x, y\rangle\),
- \(\langle x, x\rangle \ge 0\) with equality iff \(x=0\) (positive definiteness).
The inner product induces:
- length (norm): \(\|x\|_2 = \sqrt{\langle x,x\rangle}\),
- angle:
Two vectors are orthogonal if \(\langle x,y\rangle = 0\). A set of vectors \(\{v_i\}\) is orthonormal if each \(\|v_i\| = 1\) and \(\langle v_i, v_j\rangle = 0\) for \(i\ne j\).
More generally, an inner product endows \(V\) with a geometric structure, turning it into an inner product space (and if complete, a Hilbert space). Inner products allow us to talk about orthogonality (perpendicular vectors) and orthogonal projections, and to define the all-important concept of a gradient in optimization.
Geometry from the inner product: An inner product induces a norm \(\|x\| = \sqrt{\langle x,x \rangle}\) and a notion of distance \(d(x,y) = \|x-y\|\). It also defines angles: \(\langle x,y \rangle = 0\) means \(x\) and \(y\) are orthogonal. Thus, inner products generalize the geometric concepts of lengths and angles to abstract vector spaces. Many results in Euclidean geometry (like the Pythagorean theorem and law of cosines) hold in any inner product space. For example, the parallelogram law holds: \(\|x+y\|^2 + \|x-y\|^2 = 2\|x\|^2 + 2\|y\|^2\).
The Cauchy–Schwarz inequality: For any \(x,y \in \mathbb{R}^n\): \(|\langle x,y\rangle| \le \|x\|\|y\|~,\) >with equality iff \(x\) and \(y\) are linearly dependent. Geometrically, it means the absolute inner product is maximized when \(x\) and \(y\) point in the same or opposite direction.
Examples of inner products:
-
Standard (Euclidean) inner product: \(\langle x,y\rangle = x^\top y = \sum_i x_i y_i\). This underlies most optimization algorithms on \(\mathbb{R}^n\), where \(\nabla f(x)\) is defined via this inner product (so that \(\langle \nabla f(x), h\rangle\) gives the directional derivative in direction \(h\)).
-
Weighted inner product: \(\langle x,y\rangle_W = x^\top W y\) for some symmetric positive-definite matrix \(W\). Here \(\|x\|_W = \sqrt{x^\top W x}\) is a weighted length. Such inner products appear in preconditioning: by choosing \(W\) cleverly, one can measure distances in a way that accounts for scaling in the problem (e.g. the Mahalanobis distance uses \(W = \Sigma^{-1}\) for covariance \(\Sigma\)).
-
Function space inner product: \(\langle f, g \rangle = \int_a^b f(t)\,g(t)\,dt\). This turns the space of square-integrable functions on \([a,b]\) into an inner product space (a Hilbert space, \(L^2[a,b]\)). In machine learning, this is the basis for kernel Hilbert spaces, where one defines an inner product between functions to lift optimization into infinite-dimensional feature spaces.
Applications in optimization: Inner product geometry is indispensable in convex optimization.
-
Gradients: The gradient \(\nabla f(x)\) is defined as the vector satisfying \(f(x+h)\approx f(x) + \langle \nabla f(x), h\rangle\). Thus the inner product induces the notion of steepest ascent/descent direction (steepest descent is in direction \(-\nabla f(x)\) because it minimizes the inner product with the gradient). If we changed the inner product (using a matrix \(W\)), the notion of gradient would change accordingly (this idea is used in natural gradient methods).
-
Orthogonal projections: Many algorithms require projecting onto a constraint set. For linear constraints \(Ax=b\) (an affine set), the projection formula uses the inner product to find the closest point in the affine set. Projections also underpin least squares problems (solution is projection of \(b\) onto \(\mathrm{range}(A)\)) and quadratic programs (where each iteration might involve a projection).
-
Orthonormal representations: Orthonormal bases (like principal components) simplify optimization by diagonalizing quadratic forms or separating variables. For instance, in PCA we use an orthonormal basis (eigenvectors) to reduce dimensionality. In iterative algorithms, working in an orthonormal basis aligned with the problem (e.g. preconditioning) can accelerate convergence.
-
Conditioning and Gram matrix: The inner product concept leads to the Gram matrix \(G_{ij} = \langle x_i, x_j\rangle\) for a set of vectors. In machine learning, the Gram matrix (or kernel matrix) encodes similarity of features and appears in the normal equations for least squares: \(X^\top X\) is a Gram matrix whose eigenvalues tell us about problem conditioning. A well-conditioned Gram matrix (no tiny eigenvalues) means the problem is nicely scaled for gradient descent, whereas ill-conditioning (some nearly zero eigenvalues) means there are directions in weight space that are very flat, slowing convergence. Techniques like feature scaling or adding regularization (ridge regression) improve the Gram matrix’s condition number and thus algorithm performance.
Norms and distances¶
A function \(\|\cdot\|: \mathbb{R}^n \to \mathbb{R}\) is a norm if for all \(x,y\) and scalar \(\alpha\):
- \(\|x\| \ge 0\) and \(\|x\| = 0 \iff x=0\),
- \(\|\alpha x\| = |\alpha|\|x\|\) (absolute homogeneity),
- \(\|x+y\| \le \|x\| + \|y\|\) (triangle inequality).
If the vector space has an inner product, the norm \(\|x\| = \sqrt{\langle x,x\rangle}\) is called the Euclidean norm (or 2-norm). But many other norms exist, each defining a different geometry.
Common examples on \(\mathbb{R}^n\):
-
\(\ell_2\) norm (Euclidean): \(\|x\|_2 = \sqrt{\sum_i x_i^2}\), the usual length in space.
-
\(\ell_1\) norm: \(\|x\|_1 = \sum_i |x_i|\), measuring taxicab distance. In \(\mathbb{R}^2\), its unit ball is a diamond.
-
\(\ell_\infty\) norm: \(\|x\|_\infty = \max_i |x_i|\), measuring the largest coordinate magnitude. Its unit ball in \(\mathbb{R}^2\) is a square.
-
General \(\ell_p\) norm: \(\|x\|_p = \left(\sum_i |x_i|^p\right)^{1/p}\) for \(p\ge1\). This interpolates between \(\ell_1\) and \(\ell_2\), and approaches \(\ell_\infty\) as \(p\to\infty\). All \(\ell_p\) norms are convex and satisfy the norm axioms.
Every norm induces a metric (distance) \(d(x,y) = |x-y|\) on the space. Norms thus define the shape of “balls” (sets \({x: |x|\le \text{constant}}\)) and how we measure closeness. The choice of norm can significantly influence an optimization algorithm’s behavior: it affects what steps are considered small, which directions are easy to move in, and how convergence is assessed.
Unit-ball geometry: The shape of the unit ball \({x: |x| \le 1}\) reveals how a norm treats different directions. For example, the \(\ell_2\) unit ball in \(\mathbb{R}^2\) is a perfect circle, treating all directions uniformly, whereas the \(\ell_1\) unit ball is a diamond with corners along the axes, indicating that \(\ell_1\) treats the coordinate axes as special (those are “cheaper” directions since the ball extends further along axes, touching them at \((\pm1,0)\) and \((0,\pm1)\)). The \(\ell_\infty\) unit ball is a square aligned with axes, suggesting it allows more combined motion in coordinates as long as no single coordinate exceeds the limit. These shapes are illustrated below: we see the red diamond (\(\ell_1\)), green circle (\(\ell_2\)), and blue square (\(\ell_\infty\)) in \(\mathbb{R}^2\) . The geometry of the unit ball matters whenever we regularize or constrain solutions by a norm. For instance, using an \(\ell_1\) norm ball as a constraint or regularizer encourages solutions on the corners (sparse solutions), while an \(\ell_2\) ball encourages more evenly-distributed changes. An \(\ell_\infty\) constraint limits the maximum absolute value of any component, leading to solutions that avoid any single large entry.
Norms in optimization algorithms: Different norms define different algorithmic behaviors. For example, gradient descent typically uses the Euclidean norm for step sizes and convergence analysis, but coordinate descent methods implicitly use \(\ell_\infty\) (since one coordinate move at a time is like a step in \(\ell_\infty\) unit ball). Mirror descent methods use non-Euclidean norms and their duals to get better performance on certain problems (e.g. using \(\ell_1\) norm for sparse problems). The norm also figures in complexity bounds: an algorithm’s convergence rate may depend on the diameter of the feasible set in the chosen norm, \(D = \max_{\text{feasible}}|x - x^*|\). For instance, in subgradient methods, having a smaller \(\ell_2\) diameter or \(\ell_1\) diameter can improve bounds. Moreover, when constraints are given by norms (like \(|x|_1 \le t\)), projections and proximal operators with respect to that norm become subroutines in algorithms.
In summary, norms provide the metric backbone of optimization. They tell us how to measure progress (\(|x_k - x^*|\)), how to constrain solutions (\(|x| \le R\)), and how to bound errors. The choice of norm can induce sparsity, robustness, or other desired structure in solutions.
Eigenvalues, eigenvectors, and positive semidefinite matrices¶
If \(A \in \mathbb{R}^{n\times n}\) is linear, a nonzero \(v\) is an eigenvector with eigenvalue \(\lambda\) if
When \(A\) is symmetric (\(A = A^\top\)), it has:
- real eigenvalues,
- an orthonormal eigenbasis,
- a spectral decomposition
where \(Q\) is orthonormal and \(\Lambda\) is diagonal.
This is the spectral decomposition. Geometrically, a symmetric matrix acts as a scaling along \(n\) orthogonal principal directions (its eigenvectors), stretching or flipping by factors given by \(\lambda_i\).
When dealing specifically with square matrices and quadratic forms (like Hessians of twice-differentiable functions), eigenvalues become central. They describe how a symmetric matrix scales vectors in different directions. Many convex optimization conditions involve requiring a matrix (Hessian or constraint curvature matrix) to be positive semidefinite, which is an eigenvalue condition.
In optimization, the Hessian matrix of a multivariate function \(f(x)\) is symmetric. Its eigenvalues \(\lambda_i(\nabla^2 f(x))\) tell us the curvature along principal axes. If all eigenvalues are positive at a point, the function curves up in all directions (a local minimum if gradient is zero); if any eigenvalue is negative, there’s a direction of negative curvature (a saddle or maximum). So checking eigenvalues of Hessian is a way to test convexity/concavity locally.
Positive semidefinite matrices: A symmetric matrix \(Q\) is positive semidefinite (PSD) if
If \(x^\top Q x > 0\) for all \(x\ne 0\), then \(Q\) is positive definite (PD).
Why this matters: if \(f(x) = \tfrac{1}{2} x^\top Q x + c^\top x + d\), then
So \(f\) is convex iff \(Q\) is PSD.
Implications of definiteness: If \(A \succ 0\), the quadratic function \(x^T A x\) is strictly convex and has a unique minimizer at \(x=0\). If \(A \succeq 0\), \(x^T A x\) is convex but could be flat in some directions (if some \(\lambda_i = 0\), those eigenvectors lie in the nullspace and the form is constant along them). In optimization, PD Hessian \(\nabla^2 f(x) \succ 0\) means \(f\) has a unique local (and global, if domain convex) minimum at that \(x\) (since the second-order condition for optimality is satisfied strictly). PD constraint matrices in quadratic programs ensure nice properties like Slater’s condition for strong duality.
Condition number and convergence: For iterative methods on convex quadratics \(f(x) = \frac{1}{2}x^T Q x - b^T x\), the eigenvalues of \(Q\) dictate convergence speed. Gradient descent’s error after \(k\) steps satisfies roughly \(|x_k - x^*| \le (\frac{\lambda_{\max}-\lambda_{\min}}{\lambda_{\max}+\lambda_{\min}})^k |x_0 - x^*|\) (for normalized step). So the ratio \(\frac{\lambda_{\max}}{\lambda_{\min}} = \kappa(Q)\) appears: closer to 1 (well-conditioned) means rapid convergence; large ratio (ill-conditioned) means slow, zigzagging progress. Newton’s method uses Hessian inverse, effectively rescaling by eigenvalues to 1, so its performance is invariant to \(\kappa\) (locally). This explains why second-order methods shine on ill-conditioned problems: they “whiten” the curvature by dividing by eigenvalues.
Optimization interpretation of eigenvectors: The eigenvectors of \(\nabla^2 f(x^*)\) at optimum indicate principal axes of the local quadratic approximation. Directions with small eigenvalues are flat directions where the function changes slowly (possibly requiring LARGE steps unless Newton’s method is used). Directions with large eigenvalues are steep, potentially requiring small step sizes to maintain stability if using gradient descent. Preconditioning or change of variables often aims to transform the problem so that in new coordinates the Hessian is closer to the identity (all eigenvalues ~1).
Orthogonal projections and least squares¶
Let \(S\) be a subspace of \(\mathbb{R}^n\). The orthogonal projection of a vector \(b\) onto \(S\) is the unique vector \(p \in S\) minimising \(\|b - p\|_2\). Geometrically, \(p\) is the closest point in \(S\) to \(b\). If \(S = \mathrm{span}\{a_1,\dots,a_k\}\) and \(A = [a_1~\cdots~a_k]\), then projecting \(b\) onto \(S\) is equivalent to solving the least-squares problem
The solution \(x^*\) satisfies the normal equations
This is our first real convex optimisation problem:
- the objective \(\|Ax-b\|_2^2\) is convex,
- there are no constraints,
- we can solve it in closed form.
Operator norms, singular values, and spectral structure¶
Many aspects of optimization depend on how a matrix transforms vectors: how much it stretches them, in which directions it amplifies or shrinks signals, and how sensitive it is to perturbations. Operator norms and singular values provide the tools to quantify these behaviors.
Operator norms¶
Given a matrix \(A : \mathbb{R}^n \to \mathbb{R}^m\) and norms \(\|\cdot\|_p\) on \(\mathbb{R}^n\) and \(\|\cdot\|_q\) on \(\mathbb{R}^m\), the induced operator norm is defined as This quantity measures the largest amount by which \(A\) can magnify a vector when measured with the chosen norms. Several important special cases are widely used:
- \(\|A\|_{2 \to 2}\), the spectral norm, equals the largest singular value of \(A\).
- \(\|A\|_{1 \to 1}\) is the maximum absolute column sum.
- \(\|A\|_{\infty \to \infty}\) is the maximum absolute row sum.
In optimization, operator norms play a central role in determining stability. For example, gradient descent on the quadratic function
converges for step sizes \(\alpha < 2 / \|Q\|_2\). This shows that controlling the operator norm of the Hessian—or a Lipschitz constant of the gradient—directly governs how aggressively an algorithm can move.
Singular Value Decomposition (SVD)¶
Any matrix \(A \in \mathbb{R}^{m \times n}\) admits a factorization where \(U\) and \(V\) are orthogonal matrices and \(\Sigma\) is diagonal with nonnegative entries \(\sigma_1 \ge \sigma_2 \ge \cdots\). The \(\sigma_i\) are the singular values of \(A\).
Geometrically, the SVD shows how \(A\) transforms the unit ball into an ellipsoid. The columns of \(V\) give the principal input directions, the singular values are the lengths of the ellipsoid’s axes, and the columns of \(U\) give the output directions. The largest singular value \(\sigma_{\max}\) equals the spectral norm \(\|A\|_2\), while the smallest \(\sigma_{\min}\) describes the least expansion (or exact flattening if \(\sigma_{\min} = 0\)).
SVD is a powerful diagnostic tool in optimization. The ratio is the condition number of \(A\). A large condition number implies that the map stretches some directions much more than others, leading to slow or unstable convergence in gradient methods. A small condition number means \(A\) behaves more like a uniform scaling, which is ideal for optimization.
Low-rank structure¶
The rank of \(A\) is the number of nonzero singular values. When \(A\) has low rank, it effectively acts on a lower-dimensional subspace. This structure can be exploited in optimization: low-rank matrices enable dimensionality reduction, fast matrix-vector products, and compact representations. In machine learning, truncated SVD is used for PCA, feature compression, and approximating large linear operators.
Mental Map¶
Linear Algebra Foundations for Convex Optimization
Geometry + Computation for Understanding Algorithms
│
▼
Objects: Vectors, Matrices, Maps
│
▼
┌───────────────────────────────────────────────────────────────┐
│ Vector Spaces / Subspaces / Affine Sets │
│ - Feasible sets of Ax=b are affine: x = x0 + N(A) │
│ - Feasible directions live in the null space │
└───────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ The Four Fundamental Subspaces of A │
│ - Column space C(A): reachable outputs (Ax) │
│ - Null space N(A): indistinguishable inputs (Ax=0) │
│ - Row space R(A): constraint directions in x-space │
│ - Left null space N(Aᵀ): residual directions (orthogonal to C) │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────────────┐
│ Inner Products → Orthogonality → Projections │
│ - Defines angles/lengths │
│ - Least squares = projection of b onto C(A) │
│ - QR / Gram–Schmidt give stable numerical tools │
└────────────────────────────────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────────────┐
│ Norms & Dual Norms: "How we measure size" │
│ - Unit balls define geometry of constraints/regularizers │
│ - Dual norms bound dot products and appear in optimality │
└────────────────────────────────────────────────────────────────┘
│
▼
┌───────────────────────────────────────────────────────────────┐
│ Spectral Structure: Eigenvalues, PSD, SVD │
│ - PSD ⇔ convex quadratics (Hessians of quadratic objectives)│
│ - SVD shows stretching directions and conditioning │
│ - Condition number ↔ convergence speed / numerical stability │
└───────────────────────────────────────────────────────────────┘
Chapter 3: Multivariable Calculus for Optimization¶
Optimization problems are ultimately questions about how a function changes when we move in different directions. To understand this behavior, we rely on multivariable calculus. Concepts such as gradients, Jacobians, Hessians, and Taylor expansions describe how a real-valued function behaves locally and how its value varies as we adjust its inputs.
These tools form the analytical backbone of modern optimization. Gradients determine descent directions and guide first-order algorithms such as gradient descent and stochastic gradient methods. Hessians quantify curvature and enable second-order methods like Newton’s method, which adapt their steps to the shape of the objective. Jacobians and chain rules underpin backpropagation in neural networks, linking calculus to large-scale machine learning practice.
This chapter develops the differential calculus needed for convex analysis and for understanding why many optimization algorithms work.
Gradients and Directional Derivatives¶
Let \(f : \mathbb{R}^n \to \mathbb{R}\). The function is differentiable at a point \(x\) if there exists a vector \(\nabla f(x)\) such that meaning that the linear function \(h \mapsto \nabla f(x)^\top h\) provides the best local approximation to \(f\) near \(x\). The gradient is the unique vector with this property.
A closely related concept is the directional derivative. For any direction \(v \in \mathbb{R}^n\), the directional derivative of \(f\) at \(x\) in the direction \(v\) is If \(f\) is differentiable, then Thus, the gradient encodes all directional derivatives simultaneously: its inner product with a direction \(v\) tells us how rapidly \(f\) increases when we move infinitesimally along \(v\).
This immediately yields an important geometric fact. Among all unit directions \(u\), is maximized when \(u\) points in the direction of \(\nabla f(x)\), the direction of steepest ascent. The steepest descent direction is therefore \(-\nabla f(x)\), which motivates gradient-descent algorithms for minimizing functions.
For any real number \(c\), the level set of \(f\) is
At any point \(x\) with \(\nabla f(x) \ne 0\), the gradient \(\nabla f(x)\) is orthogonal to the level set \(L_{f(x)}\). Geometrically, level sets are like contour lines on a topographic map, and the gradient points perpendicular to them — in the direction of the steepest ascent of \(f\). If we wish to decrease \(f\), we move roughly in the opposite direction, \(-\nabla f(x)\) (the direction of steepest descent). This geometric fact becomes central in constrained optimization: at optimality, the gradient of the objective lies in the span of gradients of active constraints.
Jacobians¶
In optimization and machine learning, functions often map many inputs to many outputs for example, neural network layers, physical simulators, and vector-valued transformations. To understand how such functions change locally, we use the Jacobian matrix, which captures how each output responds to each input.
From derivative to gradient¶
For a scalar function , differentiability means that near any point , The gradient vector collects all partial derivatives. Each component measures how sensitive \(f\) is to changes in a single coordinate. Together, the gradient points in the direction of steepest increase, and its norm indicates how rapidly the function rises.
From gradient to Jacobian¶
Now consider a vector-valued function \(F : \mathbb{R}^n \to \mathbb{R}^m\), Each output \(F_i\) has its own gradient. Stacking these row vectors yields the Jacobian matrix:
The Jacobian provides the best linear approximation of \(F\) near \(x\): Thus, locally, the nonlinear map \(F\) behaves like the linear map \(h \mapsto J_F(x)h\). A small displacement \(h\) in input space is transformed into an output change governed by the Jacobian.
Interpreting the Jacobian¶
| Component of \(J_F(x)\) | Meaning |
|---|---|
| Row \(i\) | Gradient of output \(F_i(x)\): how the \(i\)-th output changes with each input variable. |
| Column \(j\) | Sensitivity of all outputs to \(x_j\): how varying input \(x_j\) affects the entire output vector. |
| Determinant (when \(m=n\)) | Local volume scaling: how \(F\) expands or compresses space near \(x\). |
| Rank | Local dimension of the image: whether any input directions are lost or collapsed. |
The Jacobian is therefore a compact representation of local sensitivity. In optimization, Jacobians appear in gradient-based methods, backpropagation, implicit differentiation, and the analysis of constraints and dynamics.
The Hessian and Curvature¶
For a twice–differentiable function , the Hessian matrix collects all second-order partial derivatives:
The Hessian describes the local curvature of the function. While the gradient indicates the direction of steepest change, the Hessian tells us how that directional change itself varies—whether the surface curves upward, curves downward, or remains nearly flat.
Curvature and positive definiteness¶
The eigenvalues of the Hessian determine its geometric behavior:
- If (all eigenvalues nonnegative), the function is locally convex near \(x\).
- If , the surface curves upward in all directions, guaranteeing local (and for convex functions, global) uniqueness of the minimizer.
- If the Hessian has both positive and negative eigenvalues, the point is a saddle: some directions curve up, others curve down.
Thus, curvature is directly encoded in the spectrum of the Hessian. Large eigenvalues correspond to steep curvature; small eigenvalues correspond to gently sloping or flat regions.
Taylor approximation¶
Taylor expansions provide local approximations of a function using its derivatives. These approximations form the basis of nearly all gradient-based optimization methods.
First-order approximation¶
If \(f\) is differentiable at \(x\), then for small steps \(d\), The gradient gives the best linear model of the function near \(x\). This linear approximation is the foundation of first-order methods such as gradient descent, which choose directions based on how this model predicts the function will change.
Second-order approximation¶
If \(f\) is twice differentiable, we can include curvature information: The quadratic term measures how the gradient itself changes with direction. The behavior of this term depends on the Hessian:
- If , the quadratic term is nonnegative and the function curves upward—locally bowl-shaped.
- If the Hessian has both positive and negative eigenvalues, the function bends up in some directions and down in others—characteristic of saddle points.
Role in optimization algorithms¶
Second-order Taylor models are the basis of Newton-type methods. Newton’s method chooses \(d\) by approximately minimizing the quadratic model, which balances descent direction and local curvature. Trust-region and quasi-Newton methods also rely on this quadratic approximation, modifying or regularizing it to ensure stable progress.
Thus, Taylor expansions connect a function’s derivatives to practical optimization steps, bridging geometry and algorithm design.
Smoothness and Strong Convexity¶
In optimization, the behavior of a function’s curvature strongly influences how algorithms perform. Two fundamental properties Lipschitz smoothness and strong convexity describe how rapidly the gradient can change and how much curvature the function must have.
Lipschitz continuous gradients (L-smoothness)¶
A differentiable function has an \(L\)-Lipschitz continuous gradient if This condition limits how quickly the gradient can change. Intuitively, an \(L\)-smooth function cannot have sharp bends or extremely steep local curvature. A key consequence is the Descent Lemma: This inequality states that every \(L\)-smooth function is upper-bounded by a quadratic model derived from its gradient. It provides a guaranteed estimate of how much the function can increase when we take a step.
In gradient descent, smoothness directly determines a safe step size: choosing ensures that each update decreases the function value for convex objectives. In machine learning, the constant \(L\) effectively controls how large the learning rate can be before training becomes unstable.
Strong convexity¶
A differentiable function is -strongly convex if, for some , This condition guarantees that \(f\) has at least \(\mu\) amount of curvature everywhere. Geometrically, the function always lies above its tangent plane by a quadratic bowl, growing at least as fast as a parabola away from its minimizer.
Strong convexity has major optimization implications:
- The minimizer is unique.
- Gradient descent converges linearly with step size \(\eta \le 1/L\).
- The ratio \(L / \mu\) (the condition number) dictates convergence speed.
Curvature in both directions¶
Together, smoothness and strong convexity bound the curvature of \(f\): Smoothness prevents the curvature from being too large, while strong convexity prevents it from being too small. Many convergence guarantees in optimization depend on this pair of inequalities.
These concepts, imiting curvature from above via \(L\) and from below via \(\mu\), form the foundation for analyzing the performance of first-order algorithms and understanding how learning rates, conditioning, and geometry interact.
Mental map¶
Multivariable Calculus for Optimization
How objectives change, how curvature shapes algorithms
│
▼
Local change of a scalar function f(x)
│
▼
┌───────────────────────────────────────────────────────────┐
│ Differentiability & First-Order Model │
│ f(x+h) = f(x) + ∇f(x)ᵀh + o(‖h‖) │
│ - ∇f(x): best linear approximation │
│ - Directional derivative: D_v f(x) = ∇f(x)ᵀv │
│ - Steepest descent: move along -∇f(x) │
└───────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Geometry of Level Sets │
│ L_c = {x : f(x)=c} │
│ - If ∇f(x) ≠ 0, then ∇f(x) ⟂ level set at x │
│ - Connects to constrained optimality (later: KKT) │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Vector-Valued Maps & Jacobians │
│ F: ℝⁿ → ℝᵐ │
│ - Jacobian J_F(x) stacks gradients of outputs │
│ - Linearization: F(x+h) ≈ F(x) + J_F(x) h │
│ - Chain rule foundation for backprop / sensitivity analysis │
└─────────────────────────────────────────────────────────────┘
│
▼
┌───────────────────────────────────────────────────────────┐
│ Second-Order Structure: Hessian & Curvature │
│ ∇²f(x): matrix of second partials │
│ - Curvature along v: vᵀ∇²f(x)v │
│ - Eigenvalues quantify steep/flat directions │
│ - PSD/PD Hessian ties directly to convexity (Ch.5) │
└───────────────────────────────────────────────────────────┘
│
▼
┌───────────────────────────────────────────────────────────┐
│ Taylor Models → Algorithm Design │
│ First-order: f(x+d) ≈ f(x) + ∇f(x)ᵀd │
│ Second-order: f(x+d) ≈ f(x) + ∇f(x)ᵀd + ½ dᵀ∇²f(x)d │
│ - Gradient descent uses the linear model │
│ - Newton uses the quadratic model: d ≈ -(∇²f)^{-1}∇f │
│ - Trust-region / quasi-Newton approximate curvature │
└───────────────────────────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────────────┐
│ Global Control of Local Behavior: Smoothness & Strong Convexity│
│ L-smooth: ‖∇f(x)-∇f(y)‖ ≤ L‖x-y‖ │
│ - Descent Lemma gives a quadratic upper bound │
│ - Sets safe step size: η ≤ 1/L (for convex objectives) │
│ μ-strongly convex: f lies above tangents by (μ/2)‖y-x‖² │
│ - Unique minimizer, linear convergence of gradient descent │
│ Combined curvature bounds: μI ⪯ ∇²f(x) ⪯ LI │
│ - Condition number κ = L/μ governs difficulty │
└────────────────────────────────────────────────────────────────┘
Chapter 4: Convex Sets and Geometric Fundamentals¶
Most optimization problems are constrained. The set of points that satisfy these constraints the feasible region determines where an algorithm is allowed to search. In many machine learning and convex optimization problems, this feasible region is a convex set. Convex sets have a simple but powerful geometric property: any line segment between two feasible points remains entirely within the set. This structure eliminates irregularities and makes optimization far more predictable.
Convex sets¶
A set is convex if for any two points and any , That is, the entire line segment between \(x\) and \(y\) lies inside the set. Convex sets have no “holes” or “indentations,” and this geometric regularity is what makes optimization over them tractable.
Examples¶
- Affine subspaces: .
- Halfspaces: .
- Euclidean balls: .
- balls (axis-aligned boxes): .
- Probability simplex: .
A set fails to be convex whenever some segment between two feasible points leaves the set; for example, a crescent or an annulus.
Affine sets, hyperplanes, and halfspaces¶
Affine sets generalize linear subspaces by allowing a shift. A set \(A\) is affine if for some point \(x_0\) and subspace \(S\), Affine sets are always convex, since adding a fixed offset does not affect the convexity of the underlying subspace.
A hyperplane is an affine set defined by a single linear equation: Hyperplanes act as the “flat boundaries” of higher-dimensional space and are the fundamental building blocks of polyhedra.
A halfspace is one side of a hyperplane: Halfspaces are convex and serve as basic local approximations to general convex sets.
Convex combinations and convex hulls¶
A convex combination of points is a weighted average Convex sets are precisely those that contain all convex combinations of their points.
The convex hull of a set \(S\), denoted \(\operatorname{conv}(S)\), is the set of all convex combinations of finitely many points in \(S\). It is the smallest convex set containing \(S\). Geometrically, it is the shape you obtain by stretching a tight rubber band around the points.
Polyhedra and polytopes¶
A polyhedron is an intersection of finitely many halfspaces: Polyhedra are always convex; they may be bounded or unbounded.
If a polyhedron is also bounded, it is called a polytope. Polytopes include familiar shapes such as cubes, simplices, and more general polytopes that arise as feasible regions in linear programs.
Extreme points¶
Let be a convex set. A point \(x \in C\) is an extreme point if it cannot be written as a nontrivial convex combination of other points in the set. Formally, if implies .
Geometrically, extreme points are the “corners” of a convex set. For polytopes, the extreme points are exactly the vertices. Extreme points are essential in optimization because many convex problems, such as linear programs achieve their optima at extreme points of the feasible region. This geometric fact underlies simplex-type algorithms and supports duality theory.
Cones¶
Cones generalize the idea of “directions” in geometry. They capture sets that are closed under nonnegative scaling and play a central role in convex analysis and constrained optimization.
Basic definition¶
A set \(K \subseteq \mathbb{R}^n\) is a cone if A cone is convex if it is also closed under addition:
Cones are not required to contain negative multiples of a vector, so they are generally not subspaces. Instead of extreme points, cones have extreme rays, which represent directions that cannot be formed as positive combinations of other rays. For example, in the nonnegative orthant , each coordinate axis direction is an extreme ray.
Conic hull¶
Given any set \(S\), its conic hull is the set of all conic combinations: This is the smallest convex cone containing \(S\). Conic hulls appear frequently in duality theory and in convex relaxations for optimization.
Polar cones¶
For a cone \(K\), the polar cone is defined as
Intuition:
- Polar vectors make a nonacute angle with every vector in \(K\).
Key properties:
- \(K^\circ\) is always a closed convex cone.
- If \(K\) is a subspace, then \(K^\circ\) is the orthogonal complement.
- For any closed convex cone,
Polar cones provide the geometric foundation for normal cones, dual cones, and many optimality conditions.
Tangent cones¶
For a set \(C\) and a point \(x \in C\), the tangent cone \(T_C(x)\) consists of all feasible “infinitesimal directions” from \(x\):
Intuition:
- At an interior point, \(T_C(x) = \mathbb{R}^n\): all small moves are allowed.
- At a boundary point, some directions are blocked; only directions that stay inside the set are feasible.
Tangent cones describe feasible directions for methods such as projected gradient descent or interior-point algorithms.
Normal cones¶
For a convex set \(C\), the normal cone at a point \(x \in C\) is
Interpretation:
- Every \(v \in N_C(x)\) defines a supporting hyperplane to \(C\) at \(x\).
- At interior points, the normal cone is \(\{0\}\).
- At boundary or corner points, it becomes a pointed cone of outward normals.
A fundamental relationship ties tangent and normal cones together:
Normal cones appear directly in first-order optimality conditions. For a constrained problem
\[\min_{x \in C} f(x), \]a point \(x^*\) is optimal only if
\[0 \in \nabla f(x^*) + N_C(x^*).\]This expresses a balance between the objective’s slope and the “pushback’’ from the constraint set.
Supporting Hyperplanes and Separation¶
One of the most important geometric facts about convex sets is that they can be supported or separated by hyperplanes. These results show that convex sets always admit linear boundaries that describe their shape.
Supporting Hyperplane Theorem¶
Let \(C \subseteq \mathbb{R}^n\) be nonempty, closed, and convex, and let \(x_0\) be a boundary point of \(C\). Then there exists a nonzero vector \(a\) such that
This means that the hyperplane
touches \(C\) at \(x_0\) but does not cut through it. The vector \(a\) is normal to the hyperplane. Intuitively, a supporting hyperplane is like a flat board pressed against the edge of a convex object. Supporting hyperplanes will later correspond exactly to subgradients of convex functions.
Separating Hyperplane Theorem¶
If \(C\) and \(D\) are nonempty, disjoint convex sets, then a hyperplane exists that separates them. That is, there are a nonzero vector \(a\) and scalar \(b\) such that
The hyperplane \(a^\top x = b\) places all points of \(C\) on one side and all points of \(D\) on the other. This is guaranteed purely by convexity. Separation is the geometric foundation of duality, where we attempt to separate the primal feasible region from violations of the constraints.
Mental Map¶
Convex Sets & Geometric Fundamentals
Feasible regions, geometry of constraints, and separation
│
▼
Core idea: convexity removes "bad geometry"
(segments stay inside → no holes/indentations → tractable)
│
▼
┌───────────────────────────────────────────────────────────┐
│ Definition of Convex Set │
│ C convex ⇔ θx + (1-θ)y ∈ C for all x,y∈C, θ∈[0,1] │
│ - Geometry: every chord lies inside │
│ - Optimization: feasible region supports global reasoning │
└───────────────────────────────────────────────────────────┘
│
▼
┌───────────────────────────────────────────────────────────┐
│ Affine Geometry: the "flat" building blocks │
│ - Affine set: x0 + S │
│ - Hyperplane: {x : aᵀx = b} │
│ - Halfspace: {x : aᵀx ≤ b} │
│ Role: linear constraints and local linear boundaries │
└───────────────────────────────────────────────────────────┘
│
▼
┌───────────────────────────────────────────────────────────┐
│ Convex Combinations & Convex Hull │
│ - Convex combination: Σ θ_i x_i, θ_i≥0, Σθ_i=1 │
│ - conv(S): all convex combos of points in S │
│ Why it matters: convexification / relaxations / geometry │
└───────────────────────────────────────────────────────────┘
│
▼
┌───────────────────────────────────────────────────────────┐
│ Polyhedra & Polytopes │
│ - Polyhedron: intersection of finitely many halfspaces │
│ P = {x : Ax ≤ b} │
│ - Polytope: bounded polyhedron │
│ Why it matters: LP feasible sets; two views (H- vs V-form)│
└───────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────┐
│ Extreme Points (Corners) │
│ - x extreme ⇔ cannot be written as nontrivial convex combo │
│ - For polytopes: extremes = vertices │
│ Optimization link: linear objectives attain optima at corners│
└──────────────────────────────────────────────────────────────┘
│
▼
┌───────────────────────────────────────────────────────────┐
│ Cones: scaling geometry for constraints & duality │
│ - Cone: x∈K, α≥0 ⇒ αx∈K │
│ - Convex cone: also closed under addition │
│ - Conic hull cone(S): smallest convex cone containing S │
│ - Extreme rays replace extreme points │
└───────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Local Directional Geometry at a Point x │
│ Tangent cone T_C(x): feasible infinitesimal directions │
│ - Interior point: T_C(x)=ℝⁿ │
│ - Boundary: directions restricted │
│ Normal cone N_C(x): outward normals / supporting directions │
│ - Interior point: N_C(x)={0} │
│ - Boundary/corner: pointed cone of normals │
│ Duality relation: N_C(x) = (T_C(x))° │
└─────────────────────────────────────────────────────────────┘
│
▼
┌───────────────────────────────────────────────────────────┐
│ Supporting Hyperplanes & Separation │
│ Supporting hyperplane at boundary point x0: │
│ ∃a≠0 s.t. aᵀx ≤ aᵀx0 for all x∈C │
│ Separating hyperplane for disjoint convex sets C,D: │
│ ∃a,b s.t. aᵀx ≤ b ≤ aᵀy for x∈C, y∈D │
│ Why it matters: geometry behind subgradients and duality │
└───────────────────────────────────────────────────────────┘
Chapter 5: Convex Functions¶
This chapter develops the basic tools for understanding convex functions: their definitions, geometric characterisations, first and second-order tests, and operations that preserve convexity. These tools will later support duality, optimality conditions, and algorithmic analysis.
Definitions of convexity¶
A function \(f : \mathbb{R}^n \to \mathbb{R}\) is convex if for all \(x,y\) in its domain and all \(\theta \in [0,1]\),
The graph of \(f\) never dips below the straight line between \((x,f(x))\) and \((y,f(y))\). If the inequality is strict whenever \(x \neq y\), the function is strictly convex.
A powerful geometric viewpoint comes from the epigraph: The function \(f\) is convex if and only if its epigraph is a convex set. This connects convex functions to the convex sets studied earlier.
A function is convex if and only if the region above its grapH is a convex set. This region is the function's epigraph.: Wikipedia
First-order characterisation¶
If \(f\) is differentiable, then \(f\) is convex if and only if
Interpretation:
- The tangent plane at any point \(x\) lies below the function everywhere.
- \(\nabla f(x)\) defines a supporting hyperplane to the epigraph.
- The gradient provides a global linear underestimator of \(f\).
This geometric picture is crucial in optimisation:
at a minimiser \(x^\star\), convexity implies
For nondifferentiable convex functions, the gradient is replaced by a subgradient, which plays the same role in forming supporting hyperplanes.
Second-order characterisation¶
If \(f\) is twice continuously differentiable, then convexity can be checked via curvature:
- If the Hessian is positive semidefinite everywhere, the function bends upward.
- If \(\nabla^2 f(x) \succ 0\) everywhere, the function is strictly convex.
- Negative eigenvalues indicate directions of negative curvature — impossible for convex functions.
Examples of convex functions¶
-
Affine functions:
Always convex (and concave). They define supporting hyperplanes. -
Quadratic functions with PSD Hessian:
Convex because the curvature matrix \(Q\) is PSD. -
Norms:
All norms are convex; in ML, norms induce regularisers (Lasso, ridge). -
Maximum of affine functions:
Convex because the maximum of convex functions is convex.
(Important in SVM hinge loss.) -
Log-sum-exp:
A smooth approximation to the max; convex by Jensen’s inequality. Appears in softmax, logistic regression, partition functions.
Jensen’s inequality¶
Let \(f\) be convex and \(X\) a random variable in its domain. Then:
This generalises the definition of convexity from finite averages to expectations.
Practically:
- convex functions “pull upward” under averaging,
- log-sum-exp is convex because exponential is convex,
- EM and variational methods rely on Jensen to construct lower bounds.
As a finite form, for \(\theta_i \ge 0\) with \(\sum \theta_i = 1\),
Operations that preserve convexity¶
Convexity is preserved under many natural constructions:
-
Nonnegative scaling:
If \(f\) is convex and \(\alpha \ge 0\), then \(\alpha f\) is convex. -
Addition:
If \(f\) and \(g\) are convex, then \(f+g\) is convex. -
Maximum:
\(\max\{f,g\}\) is convex. -
Affine pre-composition:
If \(A\) is a matrix, is convex. -
Monotone composition rule:
If \(f\) is convex and nondecreasing in each argument, and each \(g_i\) is convex,
then \(x \mapsto f(g_1(x), \dots, g_k(x))\) is convex.
Level sets of convex functions¶
For \(\alpha \in \mathbb{R}\), the sublevel set is
If \(f\) is convex, every sublevel set is convex.
This property is crucial because inequalities \(f(x) \le \alpha\) are ubiquitous in constraints.
Examples:
- Norm balls:
\(\{ x : \|x\|_2 \le r \}\) - Linear regression confidence ellipsoids:
\(\{ x : \|Ax - b\|_2 \le \epsilon \}\)
These sets enable convex constrained optimisation formulations.
Mental Map¶
Convex Functions
Objective landscapes with predictable geometry and guarantees
│
▼
Core idea: no bad local minima
(every local minimum is global; geometry is well-behaved)
│
▼
┌───────────────────────────────────────────────────────────┐
│ Definition of Convexity │
│ f(θx+(1−θ)y) ≤ θf(x)+(1−θ)f(y) │
│ - Graph lies below all chords │
│ - Strict convexity: inequality is strict │
│ - Epigraph view: f convex ⇔ epi(f) is a convex set │
└───────────────────────────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────────┐
│ First-Order Geometry (Supporting Hyperplanes) │
│ f(y) ≥ f(x)+∇f(x)ᵀ(y−x) │
│ - Tangent plane globally underestimates f │
│ - ∇f(x) defines a supporting hyperplane to epi(f) │
│ - Optimality: ∇f(x*)=0 ⇔ x* global minimizer (smooth case)│
│ - Nonsmooth extension: subgradients (next chapter) │
└────────────────────────────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────────┐
│ Second-Order Characterisation │
│ ∇²f(x) ⪰ 0 for all x │
│ - PSD Hessian ⇔ upward curvature everywhere │
│ - PD Hessian ⇔ strict convexity │
│ - Links convexity to eigenvalues and curvature (Ch.3) │
└────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Canonical Examples │
│ - Affine functions: supporting hyperplanes │
│ - Quadratics (Q⪰0): curvature from Hessian │
│ - Norms: regularization geometry │
│ - Max of affine functions: hinge loss, LPs │
│ - Log-sum-exp: smooth max, softmax, logistic regression │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Jensen’s Inequality │
│ f(E[X]) ≤ E[f(X)] │
│ - Convex functions penalize variability │
│ - Basis for EM, variational bounds, log-sum-exp convexity │
│ - Extends convexity from points to expectations │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Convexity-Preserving Operations │
│ - Scaling (α≥0), addition │
│ - Max of convex functions │
│ - Affine pre-composition f(Ax+b) │
│ - Monotone composition rules │
│ Role: modular construction of convex models │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Sublevel Sets │
│ {x : f(x) ≤ α} │
│ - Always convex for convex f │
│ - Enables convex inequality constraints │
│ - Norm balls, confidence ellipsoids, feasibility regions │
└─────────────────────────────────────────────────────────────┘
Chapter 6: Nonsmooth Convex Optimization – Subgradients¶
Many important convex objectives in machine learning are not differentiable everywhere. Examples include:
- the norm (nondifferentiable at zero),
- pointwise-max functions such as ,
- the hinge loss used in SVMs,
- regularisers like total variation or indicator functions of convex sets.
Although these functions have “kinks”, they remain convex and convexity guarantees the existence of supporting hyperplanes at every point.
Subgradients and the Subdifferential¶
Let \(f : \mathbb{R}^n \to \mathbb{R}\) be convex. A vector \(g \in \mathbb{R}^n\) is a subgradient of \(f\) at \(x\) if
Geometric interpretation:
- The affine function \(y \mapsto f(x) + g^\top(y-x)\) is a global underestimator of \(f\).
- Each subgradient defines a supporting hyperplane touching the epigraph of \(f\) at \((x, f(x))\).
- At smooth points, this supporting hyperplane is unique (the tangent plane).
- At kinks, there may be infinitely many supporting hyperplanes.
The subdifferential of \(f\) at \(x\) is the set
Properties:
- is always a nonempty convex set (if \(x\) is in the interior of the domain).
- If \(f\) is differentiable at \(x\), then
- If \(f\) is strictly convex, the subdifferential is a singleton except at boundary/kink points.
Thus, subgradients generalise gradients to nonsmooth convex functions, preserving the same geometric meaning.
Examples¶
Absolute value in 1D¶
Let \(f(t) = |t|\).
Then:
- If \(t > 0\), \(\partial f(t) = \{1\}\).
- If \(t < 0\), \(\partial f(t) = \{-1\}\).
- If \(t = 0\),
At the kink, any slope between \(-1\) and \(1\) supports the graph from below.
The norm¶
For \(f(x) = \|x\|_1 = \sum_i |x_i|\):
Thus:
- if \(x_i > 0\), then \(g_i = 1\),
- if \(x_i < 0\), then \(g_i = -1\),
- if \(x_i = 0\), then \(g_i \in [-1,1]\).
This structure appears directly in LASSO and compressed sensing optimality conditions.
Pointwise maximum of affine functions¶
Let
-
If only one index \(i^\star\) achieves the maximum at \(x\), then
-
If multiple indices are tied, then
the convex hull of the active slopes.
This structure underlies SVM hinge loss and ReLU-type functions.
Subgradient Optimality Condition¶
For the unconstrained convex minimisation problem
a point \(x^\star\) is optimal if and only if
Interpretation:
- At optimality, no subgradient points to a direction that would decrease \(f\).
- Geometrically, the supporting hyperplane at \(x^\star\) is horizontal, forming the flat bottom of the convex bowl.
- This generalises the smooth condition .
Subgradient Calculus¶
Subgradients satisfy powerful calculus rules that allow us to work with complex functions. Let \(f\) and \(g\) be convex.
Sum rule¶
Equality holds under mild regularity conditions (e.g., if both functions are closed).
Affine composition¶
If \(h(x) = f(Ax + b)\), then
This rule is heavily used in machine learning models, where losses depend on linear predictions \(Ax\).
Maximum of convex functions¶
If \(f(x) = \max_i f_i(x)\), then
This supports models based on hinge losses, margin-maximisation, and piecewise-linear architectures.
Chapter 7: First-Order and Geometric Optimality Conditions¶
Optimization problems seek points where no infinitesimal movement can improve the objective. For convex functions, first-order conditions give precise geometric and analytic criteria for such points to be optimal. They extend the familiar “zero gradient” condition to nonsmooth and constrained settings, linking gradients, subgradients, and the geometry of feasible regions.
These conditions form the conceptual bridge between unconstrained minimization and the Karush–Kuhn–Tucker (KKT) framework developed in the next chapter.
Orders of Optimality: Why First Order is Enough in Convex Optimization¶
For a differentiable function \(f : \mathbb{R}^n \to \mathbb{R}\), the “order’’ of an optimality condition refers to how many derivatives (or generalized derivatives) we examine around a candidate minimizer \(x^\star\):
| Order | Object inspected | Role |
|---|---|---|
| First-order | \(\nabla f(x^\star)\) or subgradients | Detects existence of a local descent direction |
| Second-order | Hessian \(\nabla^2 f(x^\star)\) | Examines curvature (minimum vs saddle vs maximum) |
| Higher-order | Third derivative and beyond | Rarely used; only for degenerate cases with vanishing curvature |
In general nonconvex optimization, these conditions are used together: a point may have \(\nabla f(x^\star) = 0\) but still be a saddle or a local maximum, so curvature (second order) must also be checked.
For convex functions, the situation is much simpler. A convex function already has non-negative curvature everywhere:
Therefore:
- any stationary point (where the first-order condition holds) cannot be a local maximum or saddle,
- if the function is proper and lower semicontinuous, first-order conditions are enough to guarantee global optimality.
As a result, in convex optimization we typically rely only on first-order conditions, possibly expressed in terms of subgradients and geometric objects (normal cones, tangent cones). This collapse of the hierarchy is one of the key simplifications that makes convex analysis powerful.
Motivation¶
Consider the basic convex problem where \(f\) is convex and \(\mathcal{X}\) is a convex set.
Intuitively, a point \(\hat{x}\) is optimal if there is no feasible direction in which we can move and strictly decrease \(f\). In the unconstrained case, every direction is feasible. In the constrained case, only directions that stay inside \(\mathcal{X}\) are allowed.
Thus, optimality can be seen as an equilibrium:
- the objective’s tendency to decrease (captured by its gradient or subgradient)
- is exactly balanced by the geometric restrictions imposed by the feasible set.
Unconstrained Convex Problems¶
For the unconstrained problem with \(f\) convex, the optimality conditions are especially simple.
Smooth case¶
If \(f\) is differentiable, then a point \(\hat{x}\) is optimal if and only if
Convexity ensures that any point where the gradient vanishes is a global minimizer, not just a local one.
Nonsmooth case¶
If \(f\) is convex but not necessarily differentiable, the gradient is replaced by the subdifferential. The condition becomes
Interpretation:
- The origin lies in the set of all subgradients at \(\hat{x}\).
- Geometrically, there exists a horizontal supporting hyperplane to the epigraph of \(f\) at \((\hat{x}, f(\hat{x}))\).
- No direction in \(\mathbb{R}^n\) gives a first-order improvement in the objective.
For smooth \(f\), this reduces to the usual condition \(\nabla f(\hat{x}) = 0\).
Constrained Convex Problems¶
Now consider the constrained problem where \(f\) is convex and \(\mathcal{X} \subseteq \mathbb{R}^n\) is a nonempty closed convex set.
If \(\hat{x}\) lies strictly inside \(\mathcal{X}\), then there is locally no distinction from the unconstrained case: all nearby directions are feasible. In that case, remains the necessary and sufficient condition for optimality.
The interesting case is when \(\hat{x}\) lies on the boundary of \(\mathcal{X}\).
First-order condition with constraints¶
The general first-order optimality condition for the constrained convex problem is:
That is, there exist
- a subgradient \(g \in \partial f(\hat{x})\), and
- a normal vector \(v \in N_{\mathcal{X}}(\hat{x})\)
such that
Interpretation:
- The objective’s slope \(g\) is exactly balanced by a normal vector \(v\) coming from the constraint set.
- If we decompose space into feasible and infeasible directions, there is no feasible direction along which \(f\) can decrease.
- Geometrically, the epigraph of \(f\) and the feasible set meet with aligned supporting hyperplanes at \(\hat{x}\).
Special cases:
- If \(\hat{x}\) is an interior point, then \(N_{\mathcal{X}}(\hat{x}) = \{0\}\), so we recover the unconstrained condition \(0 \in \partial f(\hat{x})\).
- If \(\mathcal{X}\) is an affine set, the normal cone is the orthogonal complement of its tangent subspace, and the condition aligns with equality-constrained optimality.
Chapter 8: Lagrange Multipliers and the KKT Framework¶
We now have the ingredients for understanding optimality in convex optimization:
- convex functions define well-behaved objectives,
- convex sets describe feasible regions,
- gradients and subgradients encode descent directions.
This chapter unifies these ideas. We begin with unconstrained minimization and then incorporate equality and inequality constraints. The resulting system of conditions—the Karush–Kuhn–Tucker (KKT) conditions—is the central optimality framework for constrained convex optimization.
In constrained problems, the gradient of the objective cannot vanish freely. Instead, it must be balanced by “forces’’ coming from the constraints. Lagrange multipliers measure these forces, and the KKT conditions express this balance algebraically and geometrically.
Unconstrained Convex Minimization¶
Consider the problem where \(f\) is convex and differentiable.
Gradient descent iteratively updates with step size \(\alpha_k > 0\).
Intuition:
- Moving opposite the gradient decreases \(f\).
- If the gradient is Lipschitz continuous and the step size is small enough (\(\alpha_k \le 1/L\)), then gradient descent converges to a global minimizer.
- If \(f\) is strongly convex, the minimizer is unique and convergence is faster (linear rate with an appropriate step size).
Equality-Constrained Problems and Lagrange Multipliers¶
Now consider minimizing \(f\) subject to equality constraints:
Define the Lagrangian where \(\lambda = (\lambda_1,\dots,\lambda_p)\) are the Lagrange multipliers.
Under differentiability and regularity assumptions, a point \(x^*\) is optimal only if:
-
Primal feasibility
-
Stationarity
Geometric meaning:
- The feasible set is typically a smooth manifold.
- At an optimum, the gradient of the objective must be orthogonal to all feasible directions.
- The multipliers \(\lambda_j^*\) weight the constraint normals to exactly cancel the objective’s gradient.
In other words, the objective tries to decrease, the constraints push back, and at the optimum these forces balance.
Inequality Constraints and the KKT Conditions¶
Now consider the general convex problem:
Form the Lagrangian with:
- (equality multipliers),
- (inequality multipliers).
A point \(x^*\) with multipliers \((\lambda^*,\mu^*)\) satisfies the KKT conditions:
Primal feasibility¶
Dual feasibility¶
Stationarity¶
Complementary slackness¶
Complementary slackness expresses a clear dichotomy:
- If constraint \(g_i(x) \le 0\) is inactive (strictly \(<0\)), then it applies no force: \(\mu_i^* = 0\).
- If a constraint is active at the boundary, it may exert a force: \(\mu_i^* > 0\), and then \(g_i(x^*) = 0\).
Only active constraints can push back against the objective.
Slater’s Condition — Guaranteeing Strong Duality¶
The KKT conditions always provide necessary conditions for optimality. For them to also be sufficient (and to guarantee zero duality gap), the problem must satisfy a regularity condition.
For convex problems with convex \(g_i\) and affine \(h_j\), Slater’s condition holds if there exists a strictly feasible point:
Interpretation:
- The feasible region contains an interior point.
- The constraints are not “tight” everywhere.
- The geometry is rich enough for supporting hyperplanes to behave nicely.
When Slater’s condition holds:
-
Strong duality holds:
-
The dual optimum is attained.
-
The KKT conditions are both necessary and sufficient for optimality.
Duality gap¶
For a primal problem with optimum \(p^*\) and its dual with optimum \(d^*\), the duality gap is
- A strictly positive gap indicates structural degeneracy or failure of constraint qualification.
- Slater’s condition ensures the gap is zero.
This link between geometry (interior feasibility) and algebra (zero gap) is fundamental.
Geometric and Physical Interpretation¶
The KKT conditions describe an equilibrium of forces:
- The objective gradient pushes the point in the direction of steepest decrease.
- Active constraints push back through normal vectors scaled by multipliers.
- At optimality, these forces exactly cancel.
Physically:
- Lagrange multipliers are “reaction forces’’ keeping a system on the constraint surface.
- In economics, they are “shadow prices’’ indicating how much the objective would improve if a constraint were relaxed.
- Geometrically, the stationarity condition means the objective and the active constraints share a supporting hyperplane at the optimum.
Chapter 9: Lagrange Duality Theory¶
Duality is one of the central organizing principles in convex optimization. Every constrained problem (the primal) has an associated dual problem, whose structure often provides:
- lower bounds on the primal optimal value,
- certificates of optimality,
- interpretations of constraint “prices,”
- and alternative algorithmic routes to solutions.
In convex optimization, duality is especially powerful: under mild conditions, the primal and dual attain the same optimal value. This equality — strong duality — lies behind the theory of KKT conditions, interior-point methods, and many ML algorithms such as SVMs.
The Primal Problem¶
Consider the general convex problem
where:
- \(f\) and each \(g_i\) are convex,
- each equality constraint \(h_j\) is affine.
The optimal value is
The infimum allows for the possibility that the best value is approached but not attained.
Why Duality?¶
A constrained problem can be viewed as:
minimize \(f(x)\) but pay a penalty whenever constraints are violated.
If the penalties are chosen “correctly,” one can recover the original constrained problem from an unconstrained penalized problem. Dual variables — \(\mu_i\) for inequalities and \(\lambda_j\) for equalities — precisely encode these penalties:
- \(\mu_i\) measures how costly it is to violate \(g_i(x)\le 0\),
- \(\lambda_j\) measures the sensitivity of the objective to relaxing \(h_j(x)=0\).
Duality converts constraints into prices, and transforms geometry into algebra.
The Lagrangian¶
The Lagrangian function is
with:
- \(\mu_i \ge 0\) for inequality constraints,
- \(\lambda_j \in \mathbb{R}\) unrestricted for equalities.
Interpretation:
- If \(\mu_i > 0\), violating \(g_i(x)\le 0\) incurs a penalty proportional to \(\mu_i\).
- If \(\mu_i = 0\), that constraint does not influence the Lagrangian at that point.
The Dual Function: Lower Bounds from Penalties¶
Fix \((\lambda,\mu)\) and minimize the Lagrangian with respect to \(x\):
Because \(g_i(x) \le 0\) for feasible \(x\) and \(\mu_i \ge 0\),
so taking the infimum over all \(x\) yields
Thus \(\theta\) always produces lower bounds on the true optimal value (weak duality).
Properties of the Dual Function¶
- \(\theta(\lambda,\mu)\) is always concave in \((\lambda,\mu)\) (infimum of affine functions).
- It may be \(-\infty\) if the Lagrangian is unbounded below.
The Dual Problem¶
The dual problem maximizes these lower bounds:
Let \(d^\star\) be the optimal dual value.
Weak duality guarantees:
The dual problem is always a concave maximization, i.e., a convex optimization problem in \((\lambda,\mu)\).
Strong Duality and the Duality Gap¶
If
we say strong duality holds. The duality gap is zero.
Slater’s Condition¶
If:
- \(g_i\) are convex,
- \(h_j\) are affine,
- and there exists a \(\tilde{x}\) such that
then:
- strong duality holds (\(f^\star = d^\star\)),
- dual maximizers exist,
- KKT conditions fully characterize primal–dual optimality.
Slater’s condition ensures the feasible region has interior — the constraints are not tight everywhere.
Duality and the KKT Conditions¶
When strong duality holds, the primal and dual meet at a point satisfying the KKT conditions:
Primal feasibility¶
Dual feasibility¶
Stationarity¶
Complementary slackness¶
Together these conditions ensure:
Geometrically, the gradients of the active constraints form a supporting hyperplane that “touches’’ the objective exactly at the optimum.
Interpretation of Dual Variables¶
Dual variables have consistent interpretations across optimization, ML, and economics.
Shadow Prices / Constraint Forces¶
-
\(\mu_i^\star\): the shadow price for relaxing \(g_i(x)\le 0\).
Large \(\mu_i^\star\) means the constraint is tight and costly to relax. -
\(\lambda_j^\star\): sensitivity of the optimal value to perturbations of \(h_j(x)=0\).
ML Interpretations¶
- Support Vector Machines: dual variables select support vectors (only points with \(\mu_i^\star > 0\) matter).
- L1-Regularization / Lasso: can be viewed through a dual constraint on parameter magnitudes.
- Regularized learning problems: the dual expresses the balance between data fit and model complexity.
Duality often reveals structure that is hidden in the primal, providing clearer geometric insight and sometimes simpler optimization paths.
Chapter 10: Multi-Objective Convex Optimization¶
Up to now we have focused on problems with a single objective: minimize one convex function over a convex set. However, real-world learning, engineering, and decision-making tasks almost always involve competing criteria:
- loss vs. fairness,
- return vs. risk,
- energy use vs. performance.
Multi-objective optimization provides the mathematical framework for balancing such competing goals. In convex settings, these trade-offs have elegant geometric and analytic structure, captured by Pareto optimality and by scalarization techniques that convert multiple objectives into a single convex problem.
Classical Optimality (One Objective)¶
In standard convex optimization, we solve:
where \(f\) is convex and \(\mathcal{X}\) is convex.
In this setting, it is natural to speak of the minimizer — or set of minimizers — because the task is governed by a single quantitative measure.
However, when multiple objectives \((f_1,\dots,f_k)\) must be minimized simultaneously, a single “best” point usually does not exist. Improving one objective can worsen another. Multi-objective optimization replaces the idea of a unique minimizer with the idea of efficient trade-offs.
Multi-Objective Convex Optimization¶
A multi-objective optimization problem takes the form
where each \(f_i\) is convex.
This framework appears in many ML and statistical tasks:
| Domain | Objective 1 | Objective 2 | Trade-off |
|---|---|---|---|
| Regression | Fit error | Regularization | Accuracy vs. complexity |
| Fair ML | Loss | Fairness metric | Utility vs. fairness |
| Portfolio | Return | Risk | Profit vs. stability |
| Autoencoders | Reconstruction | KL divergence | Fidelity vs. disentanglement |
Because objectives typically conflict, one cannot minimize all simultaneously. The natural notion of optimality becomes Pareto efficiency.
Pareto Optimality¶
Strong Pareto Optimality¶
A point \(x^*\) is Pareto optimal if there is no other \(x\) such that
with strict inequality for at least one objective. Thus, no trade-off-free improvement is possible: to improve one metric, you must worsen another.
Weak Pareto Optimality¶
A point \(x^*\) is weakly Pareto optimal if no feasible point satisfies
Weak optimality rules out simultaneous strict improvement in all objectives.
Geometric View¶
For two objectives \((f_1, f_2)\), the feasible set in objective space is a region in \(\mathbb{R}^2\). Its lower-left boundary, the set of points not dominated by others, is the Pareto frontier.
- Points on the frontier are the best achievable trade-offs.
- Points above or inside the region are dominated and thus suboptimal.
The Pareto frontier explicitly exposes the structure of trade-offs in a problem.
Scalarization: Turning Many Objectives into One¶
Multi-objective problems rarely have a unique minimizer. Scalarization constructs a single-objective surrogate problem whose solutions lie on the Pareto frontier.
Weighted-Sum Scalarization¶
- The weights encode relative importance.
- Varying \(w\) traces (part of) the Pareto frontier.
- When \(f_i\) and \(\mathcal{X}\) are convex, this method recovers the convex portion of the frontier.
ε-Constraint Method¶
- Here the tolerances \(\varepsilon_i\) act as performance budgets.
- Each choice of \(\varepsilon\) yields a different Pareto-efficient point.
This formulation directly highlights the trade-off between one primary objective and several secondary constraints.
Duality Connection¶
Scalarization has a tight relationship with duality (Chapter 9):
- Weights \(w_i\) in a weighted sum act like dual variables.
- Regularization parameters (e.g., the \(\lambda\) in L2 or L1 regularization) correspond to dual multipliers.
- Moving along \(\lambda\) traces the Pareto frontier between data fit and model complexity.
This connection explains why tuning regularization is equivalent to choosing a point on a trade-off curve.
Examples and Applications¶
Example 1: Regularized Least Squares¶
Consider
Two scalarizations:
-
Weighted:
-
ε-constraint:
\(\lambda\) and \(\tau\) trace the same Pareto curve — the classical bias–variance trade-off.
Example 2: Portfolio Optimization (Risk–Return)¶
Let \(w\) be portfolio weights, \(\mu\) expected returns, and \(\Sigma\) the covariance matrix. Objectives:
Weighted scalarization:
Varying \(\alpha\) recovers the efficient frontier of Modern Portfolio Theory.
Example 3: Fairness–Accuracy in ML¶
where \(D\) is a fairness metric.
Scalarized form:
Tuning \(\lambda\) walks across the fairness–accuracy Pareto frontier.
Example 4: Variational Autoencoders and β-VAE¶
The ELBO is:
Objectives:
- Reconstruction fidelity,
- Latent simplicity.
β-VAE scalarization:
\(\beta\) controls the trade-off between reconstruction and disentanglement — a Pareto frontier in latent space.
Overall, multi-objective convex optimization extends the geometry and structure of convex analysis to settings with trade-offs and competing priorities. The Pareto frontier reveals the set of achievable compromises, while scalarization methods let us navigate this frontier using tools from single-objective convex optimization, duality, and regularization theory.
Chapter 11: Balancing Fit and Complexity¶
Most real-world learning and estimation problems must balance two competing goals:
- Fit the observed data well, and
- Control the complexity of the model to avoid overfitting, instability, or noise amplification.
Regularization formalizes this trade-off by adding a convex penalty term to the objective. This chapter develops the structure, interpretation, and algorithms behind regularized convex problems, and shows how regularization corresponds directly to Pareto-optimal trade-offs (Chapter 10) between data fidelity and model simplicity.
Motivation: Fit vs. Complexity¶
Suppose we wish to estimate parameters \(x\) from data via a loss function \(f(x)\). If the data are noisy or the model is high-dimensional, solutions minimizing \(f\) alone may be unstable or overly complex. We introduce a regularizer \(R(x)\), typically convex, to encourage desirable structure:
- \(f(x)\): measures data misfit (e.g., squared loss, logistic loss).
- \(R(x)\): penalizes complexity (e.g., \(\ell_1\) norm for sparsity, \(\ell_2\) norm for smoothness).
- \(\lambda\): controls the trade-off.
- Small \(\lambda\): excellent data fit, potentially overfitting.
- Large \(\lambda\): simpler model, potentially underfitting.
This is a scalarized multi-objective optimization problem of \((f, R)\).
Bicriterion Optimization and the Pareto Frontier¶
Regularization corresponds to the bicriterion objective:
A point \(x^*\) is Pareto optimal if there is no feasible \(x\) such that: with strict inequality in at least one component.
For convex \(f\) and \(R\):
- Every \(\lambda \ge 0\) yields a Pareto-optimal point,
- The mapping from \(\lambda\) to constraint level \(R(x^*)\) is monotone,
- The Pareto frontier is convex and can be traced continuously by varying \(\lambda\).
Thus, tuning \(\lambda\) moves the solution along the fit–complexity frontier.
Why Control the Size of the Solution?¶
Inverse problems such as \(Ax \approx b\) are often ill-posed or ill-conditioned:
- Small noise in \(b\) may cause large variability in the solution \(x\).
- If \(A\) is rank-deficient or nearly singular, infinitely many solutions exist.
Example: ridge regression
The optimality condition is
Benefits of L2 regularization:
- \(A^\top A + \lambda I\) becomes positive definite for any \(\lambda > 0\),
- the solution becomes unique and stable,
- small singular directions of \(A\) are suppressed.
Interpretation: Regularization trades variance for stability by damping directions in which the data provide little information.
Constrained vs. Penalized Formulations¶
Regularized problems can be expressed equivalently as constrained problems:
The Lagrangian is
The penalized form
is the dual of the constrained form. Under convexity and Slater’s condition, the two forms yield the same set of optimal solutions. The corresponding KKT conditions are:
Here:
- If \(R(x^*) < t\), then \(\lambda^* = 0\).
- If \(\lambda^* > 0\), then \(R(x^*) = t\) (constraint active).
Thus \(\lambda\) is the Lagrange multiplier controlling the slope of the Pareto frontier.
Common Regularizers and Their Effects¶
(a) L2 Regularization (Ridge)¶
- Smooth and strongly convex.
- Shrinks coefficients uniformly.
- Improves conditioning.
- MAP interpretation: Gaussian prior \(x \sim \mathcal{N}(0,\tau^2 I)\).
(b) L1 Regularization (Lasso)¶
- Convex but not differentiable → promotes sparsity.
- The \(\ell_1\) ball has corners aligned with coordinate axes, encouraging zeros in \(x\).
- Proximal operator (soft-thresholding):
- MAP interpretation: Laplace prior.
(c) Elastic Net¶
- Combines sparsity with numerical stability.
- Useful with correlated features.
(d) Beyond L1/L2: Structured Regularizers¶
| Regularizer | Formula | Effect |
|---|---|---|
| Tikhonov | \(\|Lx\|_2^2\) | smoothness via operator \(L\) |
| Total Variation | \(\|\nabla x\|_1\) | piecewise-constant signals/images |
| Group Lasso | \(\sum_g \|x_g\|_2\) | structured sparsity across groups |
| Nuclear Norm | \(\|X\|_* = \sum_i \sigma_i\) | low-rank matrices |
Each regularizer defines a geometry for the solution — ellipsoids, diamonds, polytopes, or spectral shapes.
Choosing the Regularization Parameter \(\lambda\)¶
(a) Trade-Off Behavior¶
- \(\lambda \downarrow\): favors small training error, high variance.
- \(\lambda \uparrow\): favors simplicity, higher bias.
\(\lambda\) selects a point on the fit–complexity Pareto frontier.
(b) Cross-Validation¶
The most common practice:
- Split data into folds.
- Train on \(k-1\) folds, validate on the remaining fold.
- Choose \(\lambda\) minimizing average validation error.
Guidelines:
- Standardize features for L1/Elastic Net.
- Use time-aware CV for dependent data.
- Use the “one-standard-error rule” for simpler models.
(c) Other Selection Methods¶
- Information criteria (AIC, BIC) for sparsity.
- L-curve or discrepancy principle in inverse problems.
- Regularization paths: computing \(x^*(\lambda)\) for many \(\lambda\).
Algorithmic View¶
Most regularized problems have the form:
where \(f\) is smooth convex and \(R\) is convex (possibly nonsmooth).
Common algorithms:
| Method | Idea | When Useful |
|---|---|---|
| Proximal Gradient (ISTA/FISTA) | Gradient step on \(f\), proximal step on \(R\) | L1, TV, nuclear norm |
| Coordinate Descent | Update coordinates cyclically | Lasso, Elastic Net |
| ADMM | Split problem to exploit structure | Large-scale or distributed settings |
Proximal operators allow efficient handling of nonsmooth penalties. FISTA achieves optimal \(O(1/k^2)\) rate for smooth+convex problems.
Bayesian Interpretation¶
Regularization corresponds to MAP (maximum a posteriori) inference.
Linear model:
With prior \(x \sim p(x)\), MAP estimation solves:
Examples:
- Gaussian prior \(p(x) \propto e^{-\|x\|_2^2 / (2\tau^2)}\)
→ L2 penalty with \(\lambda = \sigma^2/(2\tau^2)\). - Laplace prior
→ L1 penalty and sparse MAP estimate.
Thus regularization is prior information: it encodes assumptions about structure, smoothness, or sparsity before observing data.
Regularization is therefore a unifying concept in optimization, statistics, and machine learning: it stabilizes ill-posed problems, enforces structure, and represents explicit choices on the Pareto frontier between data fit and complexity.
Chapter 12: Algorithms for Convex Optimization¶
In the previous chapters, we built the mathematical foundations of convex optimization: convex sets, convex functions, gradients, subgradients, KKT conditions, and duality. Now we answer the practical question: How do we actually solve convex optimization problems in practice?
This chapter now serves as the algorithmic backbone of the book. It bridges theoretical convex analysis (Chapters 3–11) with the practical numerical methods that solve those problems. Each algorithm here can be seen as a computational lens on a convex geometry concept — gradients as supporting planes, Hessians as curvature maps, and proximal maps as projection operators. Later chapters (13–15) extend these ideas to constrained, stochastic, and large-scale environments.
Problem classes vs method classes¶
Different convex problems call for different algorithmic structures. Here is the broad landscape:
| Problem Type | Typical Formulation | Representative Methods | Examples |
|---|---|---|---|
| Smooth, unconstrained | \(\min_x f(x)\), convex and differentiable | Gradient descent, Accelerated gradient, Newton | Logistic regression, least squares |
| Smooth with simple constraints | \(\min_x f(x)\) s.t. \(x \in \mathcal{X}\) (box, ball, simplex) | Projected gradient | Constrained regression, probability simplex |
| Composite convex (smooth + nonsmooth) | \(\min_x f(x) + R(x)\) | Proximal gradient, coordinate descent | Lasso, Elastic Net, TV minimization |
| General constrained convex | \(\min f(x)\) s.t. \(g_i(x) \le 0, h_j(x)=0\) | Interior-point, primal–dual methods | LP, QP, SDP, SOCP |
First-order methods: Gradient descent¶
We solve where \(f\) is convex, differentiable, and (ideally) \(L\)-smooth: its gradient is Lipschitz with constant \(L\), meaning
Smoothness lets us control step sizes.
Gradient descent iterates where \(\alpha_k>0\) is the step size (also called learning rate in machine learning). Typical choices:
- constant \(\alpha_k = 1/L\) when \(L\) is known,
- backtracking line search when \(L\) is unknown,
- diminishing step sizes in some settings.
Derivation:
Around \(x_t\), we can approximate \(f\) using its Taylor expansion:
\[ f(x) \approx f(x_t) + \langle \nabla f(x_t), x - x_t \rangle. \]We assume \(f\) behaves approximately like its tangent plane near \(x_t\). But tf we were to minimize just this linear model, we would move infinitely far in the direction of steepest descent \(-\nabla f(x_t)\), which is not realistic or stable. This motivates adding a locality restriction: we trust the linear approximation near \(x_t\), not globally. To prevent taking arbitrarily large steps, we add a quadratic penalty for moving away from \(x_t\):
\[ f(x) \approx f(x_t) + \langle \nabla f(x_t), x - x_t \rangle + \frac{1}{2\eta} \|x - x_t\|^2, \]where \(\eta > 0\) is the learning rate or step size.
- The linear term pulls \(x\) in the steepest descent direction.
- The quadratic term acts like a trust region, discouraging large deviations from \(x_t\).
- \(\eta\) trades off aggressive progress vs stability:
- Small \(\eta\) → cautious updates.
- Large \(\eta\) → bold updates (risk of divergence).
We define the next iterate as the minimizer of the surrogate objective:
\[ x_{t+1} = \arg\min_{x \in \mathcal{X}} \Big[ f(x_t) + \langle \nabla f(x_t), x - x_t \rangle + \frac{1}{2\eta} \|x - x_t\|^2 \Big]. \]Ignoring the constant term \(f(x_t)\) and differentiating w.r.t. \(x\):
\[ \nabla f(x_t) + \frac{1}{\eta}(x - x_t) = 0 \]Solving:
\[ x_{t+1} = x_t - \eta \nabla f(x_t) \]
Convergence: For convex, \(L\)-smooth \(f\), gradient descent with a suitable fixed step size satisfies where \(f^\star\) is the global minimum. This \(O(1/k)\) sublinear rate is slow compared to second-order methods, but each step is extremely cheap: you only need \(\nabla f(x_k)\).
When to use gradient descent:
- High-dimensional smooth convex problems (e.g. large-scale logistic regression).
- You can compute gradients cheaply.
- You only need moderate accuracy.
- Memory constraints rule out storing or factoring Hessians.
Accelerated first-order methods¶
Plain gradient descent has an \(O(1/k)\) rate for smooth convex problems. Remarkably, we can do better — and in fact, provably optimal — by adding momentum.
Nesterov acceleration¶
Nesterov’s accelerated gradient method modifies the update using a momentum-like extrapolation. One common form of Nesterov acceleration uses two sequences \(x_k\) and \(y_k\):
- Maintain two sequences \(x_k\) and \(y_k\).
- Take a gradient step from \(y_k\):
- Extrapolate:
The extra momentum term \(\beta_k (x_{k+1}-x_k)\) uses past iterates to “look ahead” and can significantly accelerate convergence.
Convergece: For smooth convex \(f\), accelerated gradient achieves which is optimal for any algorithm that uses only gradient information and not higher derivatives.
- Acceleration is effective for well-behaved smooth convex problems.
- It can be more sensitive to step size and noise than plain gradient descent.
- Variants such as FISTA apply acceleration in the composite setting \(f + R\).
The convergence of gradient descent depends strongly on the geometry of the level sets of the objective function. When these level sets are poorly conditioned—that is, highly anisotropic or elongated (not spherical) the gradient directions tend to oscillate across narrow valleys, leading to zig-zag behavior and slow convergence. In contrast, when the level sets are well-conditioned (approximately spherical), gradient descent progresses efficiently toward the minimum. Thus, the efficiency of gradient-based methods is governed by how aspherical (anisotropic) the level sets are, which is directly related to the condition number of the Hessian.
Subgradient Methods¶
Even when \(f\) is not differentiable, we can minimise it using subgradient descent:
Key features:
- Requires only a subgradient (no differentiability needed).
- Works for any convex function.
- Stepsizes must typically decrease (e.g. , ).
- Guaranteed convergence for convex \(f\), but generally slow.
Convergence rates (worst case)¶
- Smooth convex gradient descent: \(O(1/k)\) or \(O(1/k^2)\).
- Nonsmooth subgradient descent:
This slower rate reflects the lack of curvature information at kinks.
Proximal and Smoothed Alternatives¶
Subgradient descent can be slow. Two important families of methods overcome this:
(1) Proximal methods¶
For a convex function \(f\), the proximal operator is
Proximal algorithms (e.g., ISTA, FISTA, ADMM) can handle nonsmooth terms like:
- regularisation,
- indicator functions of convex sets,
- total variation penalties.
They achieve faster and more stable convergence than basic subgradient descent.
(2) Smoothing techniques¶
Many nonsmooth convex functions have smooth approximations:
- Replace with the Huber loss.
- Replace with softplus.
- Replace with log-sum-exp, a smooth convex approximation.
Smoothing preserves convexity while allowing the use of fast gradient methods.
Steepest Descent Method¶
The steepest descent method generalizes gradient descent by depending on the choice of norm used to measure step size or direction. It finds the direction of maximum decrease of the objective function under a unit norm constraint.
The norm defines the “geometry” of optimization. Gradient descent is steepest descent under the Euclidean norm. Changing the norm changes what “steepest” means, and can greatly affect convergence, especially for ill-conditioned or anisotropic problems. The norm in steepest descent determines the geometry of the descent and choosing an appropriate norm effectively makes the level sets of the function more rounded (more isotropic), which greatly improves convergence.
At a point \(x\), and for a chosen norm \(|\cdot|\):
This defines the normalized steepest descent direction — the unit-norm direction that yields the most negative directional derivative (i.e., the steepest local decrease of \(f\)).
- \(\Delta x_{\text{nsd}}\): normalized steepest descent direction
- \(\Delta x_{\text{sd}}\): unnormalized direction (scaled by the gradient norm)
For small steps \(v\), The term \(\nabla f(x)^T v\) describes how fast \(f\) increases in direction \(v\). To decrease \(f\) most rapidly, we pick \(v\) that minimizes this inner product — subject to \(|v| = 1\).
- The result depends on which norm we use to measure the “size” of \(v\).
- The corresponding dual norm \(|\cdot|_*\) determines how we measure the gradient’s magnitude.
Thus, the steepest descent direction always aligns with the negative gradient, but it is scaled and shaped according to the geometry induced by the chosen norm.
The choice of norm determines:
- The shape of the unit ball \({v : |v| \le 1}\),
- The direction of steepest descent, since the minimization is constrained by that shape,
- The dual norm \(|\nabla f(x)|_*\) that measures the gradient’s size.
Different norms yield different “geometries” of descent:
| Norm | Unit Ball Shape | Dual Norm | Effect on Direction |
|---|---|---|---|
| \(\ell_2\) | Circle / sphere | \(\ell_2\) | Direction is opposite to gradient |
| \(\ell_1\) | Diamond | \(\ell_\infty\) | Moves along coordinate of largest gradient |
| \(\ell_\infty\) | Square | \(\ell_1\) | Moves opposite to sum of all gradient signs |
| Quadratic \((x^T P x)^{1/2}\) | Ellipsoid | Weighted \(\ell_2\) | Scales direction by preconditioner \(P^{-1}\) |
Thus, the norm defines how “distance” and “steepness” are perceived, shaping how the algorithm moves through the landscape of \(f(x)\).
Conjugate Gradient Method — Fast Optimization for Quadratic Objectives¶
Gradient descent can be painfully slow when the level sets of the objective are long and skinny an indication that the Hessian has very different curvature in different directions (poor conditioning). The Conjugate Gradient (CG) method fixes this without forming or inverting the Hessian. It exploits the exact structure of quadratic functions to build advanced search directions that incorporate curvature information at almost no extra cost.
CG is a first-order method that behaves like a second-order method for quadratics.
For a quadratic objective function:
with \(A \succ 0\), the level sets are ellipses shaped by the eigenvalues of \(A\). If \(A\) is ill-conditioned, these ellipses are highly elongated. Gradient descent follows the steepest Euclidean descent direction, which points perpendicular to level sets. On elongated ellipses, this produces a zig-zag path that wastes many iterations.
CG replaces the steepest-descent directions with conjugate directions. Two nonzero vectors \(p_i, p_j\) are said to be A-conjugate if
This is orthogonality measured in the geometry induced by the Hessian \(A\). Why is this useful?
- Moving along an A-conjugate direction eliminates error components associated with a different eigen-direction of \(A\).
- Once you minimize along a conjugate direction, you never need to correct that direction again.
- After \(n\) mutually A-conjugate directions, all curvature directions are resolved → exact solution.
In contrast, gradient descent repeatedly re-corrects previous progress.
Algorithm (Linear CG): We solve the quadratic minimization problem or, equivalently, the linear system \(Ax = b\). Let
For \(k = 0,1,2,\dots\):
-
Step size
-
Update iterate
-
Update residual (negative gradient)
-
Direction scaling
-
New conjugate direction
Stop when \(\|r_k\|\) is below tolerance.
Every new direction \(p_{k+1}\) is constructed to be A-conjugate to all previous ones, and this is preserved automatically by the recurrence.
Why CG Is Fast: For an \(n\)-dimensional quadratic, CG solves the problem in at most \(n\) iterations in exact arithmetic. In practice, due to floating-point errors and finite precision, it converges much earlier, typically in \(O(\sqrt{\kappa})\) iterations, where \(\kappa = \lambda_{\max}/\lambda_{\min}\) is the condition number. The convergence bound in the A-norm is:
This is dramatically better than the \(O(1/k)\) rate of gradient descent.
CG is ideal when:
- The problem is a quadratic or a linear system with symmetric positive definite (SPD) matrix \(A\).
- \(A\) is large and sparse or available as a matrix–vector product.
- You cannot form or store \(A^{-1}\) or even the full matrix \(A\).
- You want a Hessian-aware method but cannot afford Newton’s method.
Typical scenarios:
| Application | Why CG fits |
|---|---|
| Large linear systems \(A x = b\) | Only requires \(A p\), not factorization. |
| Ridge regression | Normal equations form an SPD matrix. |
| Kernel ridge regression | Solves \((K+\lambda I)\alpha = y\) efficiently. |
| Newton steps in ML | Inner solver for Hessian systems without forming Hessian. |
| PDEs and scientific computing | Sparse SPD matrices, ideal for CG. |
Assumptions Required for CG: To guarantee correctness of linear CG, we require:
- \(A\) is symmetric
- \(A\) is positive definite
- Objective is strictly convex quadratic
- Arithmetic is exact (for the finite-step guarantee)
If the function is not quadratic or Hessian is not SPD, use Nonlinear CG, which generalizes the idea but loses finite-step guarantees.
Practical Notes:
- You only need matrix–vector products \(Ap\).
- Storage cost is \(O(n)\).
- Preconditioning (replacing the system with \(M^{-1} A\)) improves conditioning and accelerates convergence dramatically.
- Periodic re-orthogonalization can help in long runs with floating-point drift.
CG is the optimal descent method for quadratic objectives: it constructs Hessian-aware conjugate directions that efficiently resolve curvature, giving Newton-like speed while requiring only gradient-level operations.
Newton’s method and second-order methods¶
First-order methods (like gradient descent) only use gradient information. Newton’s method, in contrast, incorporates curvature information from the Hessian to take steps that better adapt to the local geometry of the function. This often leads to much faster convergence near the optimum.
From Chapter 3, the second-order Taylor approximation of \(f(x)\) around a point \(x_k\) is:
If we temporarily trust this quadratic model, we can choose \(d\) to minimize the right-hand side. Differentiating with respect to \(d\) and setting to zero gives:
Hence, the Newton step is:
This step aims directly at the stationary point of the local quadratic model. When the iterates are sufficiently close to the true minimizer of a strictly convex \(f\), Newton’s method achieves quadratic convergence—dramatically faster than the \(O(1/k)\) or \(O(1/k^2)\) rates typical of first-order algorithms.
However, far from the minimizer the quadratic model may be inaccurate, the Hessian may be indefinite, or the step may be unreasonably large. For stability, Newton’s method is almost always paired with a line search or trust-region strategy that adjusts step length based on how well the model predicts actual decrease.
Solving the Newton System¶
Each iteration requires solving
If \(H\) is symmetric positive definite, a Cholesky factorization
allows efficient and numerically stable solution via two triangular solves:
- \(L y = -g\)
- \(L^\top \Delta x_{\text{nt}} = y\)
This avoids forming \(H^{-1}\) explicitly.
The Newton decrement:
gauges proximity to the optimum and provides a natural stopping criterion: \(\lambda(x)^2/2 < \varepsilon\).
Computationally, the dominant cost is solving the Newton system. For dense, unstructured problems this costs \(\approx (1/3)n^3\) operations, though sparsity or structure can reduce this dramatically. Because of this cost, Newton’s method is most appealing for problems of moderate dimension or for situations where Hessian systems can be solved efficiently using sparse linear algebra or matrix–free iterative methods.
Gauss–Newton Method¶
The Gauss–Newton method is a specialization of Newton’s method for nonlinear least squares problems
where \(r(x)\) is a vector of residual functions and a nonlinear function of \(x\) and \(J\) is its Jacobian. Newton’s Hessian decomposes as
The second term involves the curvature of the residuals. When \(r(x)\) is approximately linear near the optimum, this term is small. Gauss–Newton drops it, giving the approximation
leading to the Gauss–Newton step:
Thus each iteration reduces to solving a (potentially large but structured) least-squares system, avoiding full Hessians entirely. The Levenberg–Marquardt method adds a damping term,
which interpolates smoothly between
- gradient descent (large \(\lambda\)), and
- Gauss–Newton (small \(\lambda\)).
Damping improves robustness when the Jacobian is rank-deficient or when the neglected second-order terms are not negligible Gauss–Newton and Levenberg–Marquardt are highly effective when the residuals are nearly linear—common in curve fitting, bundle adjustment, and certain layerwise training procedures in deep learning—yielding fast convergence without the expense of full second derivatives.
Quasi-Newton methods¶
When computing or storing the Hessian is too expensive, we can build low-rank approximations of \(\nabla^2 f(x_k)\) or its inverse. These methods use gradient information from previous steps to estimate curvature.
The most famous examples are:
- BFGS (Broyden–Fletcher–Goldfarb–Shanno)
- DFP (Davidon–Fletcher–Powell)
- L-BFGS (Limited-memory BFGS) — for very large-scale problems.
Quasi-Newton methods (BFGS, L-BFGS) build inverse-Hessian approximations from gradient differences, achieving superlinear convergence with low memory. They maintain many of Newton’s fast local convergence properties, but with per-iteration costs similar to first-order methods. For instance, BFGS maintains an approximation \(B_k \approx \nabla^2 f(x_k)^{-1}\) updated via gradient and step differences:
where \(s_k = x_{k+1} - x_k\) and \(y_k = \nabla f(x_{k+1}) - \nabla f(x_k)\).
These methods achieve superlinear convergence in practice, making them popular for large smooth optimization problems.
When to use Newton or quasi-Newton methods:
- You need high-accuracy solutions.
- The problem is smooth and reasonably well-conditioned.
- The dimension is moderate, or Hessian systems can be solved efficiently (e.g., using sparse linear algebra).
For large, ill-conditioned, or nonsmooth problems, first-order or proximal methods (Chapter 10) are typically more suitable.
Constraints and nonsmooth terms: projection and proximal methods¶
In practice, most convex objectives are not just “nice smooth \(f(x)\)”. They often have:
- constraints \(x \in \mathcal{X}\),
- nonsmooth regularisers like \(\|x\|_1\),
- penalties that encode robustness or sparsity (Chapter 6).
Two core ideas handle this: projected gradient and proximal gradient.
Projected gradient descent¶
Setting: Minimise convex, differentiable \(f(x)\) subject to \(x \in \mathcal{X}\), where \(\mathcal{X}\) is a simple closed convex set (Chapter 4).
Algorithm:
- Gradient step:
- Projection:
Interpretation:
- You take an unconstrained step downhill,
- then you “snap back” to feasibility by Euclidean projection.
Examples of \(\mathcal{X}\) where projection is cheap:
- A box: \(l \le x \le u\) (clip each coordinate).
- The probability simplex \(\{x \ge 0, \sum_i x_i = 1\}\) (there are fast projection routines).
- An \(\ell_2\) ball \(\{x : \|x\|_2 \le R\}\) (scale down if needed).
Projected gradient is the constrained version of gradient descent. It maintains feasibility at every iterate.
Proximal gradient (forward–backward splitting)¶
Setting: Composite convex minimisation where:
- \(f\) is convex, differentiable, with Lipschitz gradient,
- \(R\) is convex, possibly nonsmooth.
Typical choices of \(R(x)\):
- \(R(x) = \lambda \|x\|_1\) (sparsity),
- \(R(x) = \lambda \|x\|_2^2\) (ridge),
- \(R(x)\) is the indicator function of a convex set \(\mathcal{X}\), i.e. \(R(x)=0\) if \(x \in \mathcal{X}\) and \(+\infty\) otherwise — this encodes a hard constraint.
Define the proximal operator of \(R\):
Proximal gradient method:
- Gradient step on \(f\):
- Proximal step on \(R\):
This is also called forward–backward splitting: “forward” = gradient step, “backward” = prox step.
Interpretation:¶
- The prox step “handles” the nonsmooth or constrained part exactly.
- For \(R(x)=\lambda \|x\|_1\), \(\mathrm{prox}_{\alpha R}\) is soft-thresholding, which promotes sparsity in \(x\).
This is the heart of \(\ell_1\)-regularised least-squares (LASSO) and many sparse recovery problems. - For \(R\) as an indicator of \(\mathcal{X}\), \(\mathrm{prox}_{\alpha R} = \Pi_\mathcal{X}\), so projected gradient is a special case of proximal gradient.
This unifies constraints and regularisation.
When to use proximal / projected gradient¶
- High-dimensional ML/statistics problems.
- Objectives with \(\ell_1\), group sparsity, total variation, hinge loss, or indicator constraints.
- You can evaluate \(\nabla f\) and compute \(\mathrm{prox}_{\alpha R}\) cheaply.
- You don’t need absurdly high accuracy, but you do need scalability.
This is the standard tool for modern large-scale convex learning problems.
Penalties, barriers, and interior-point methods¶
So far we’ve assumed either:
- simple constraints we can project onto,
- or nonsmooth terms we can prox.
What if the constraints are general convex inequalities \(g_i(x)\le0\): Enter penalty methods, barrier methods, and (ultimately) interior-point methods.
Penalty methods¶
Turn constrained optimisation into unconstrained optimisation by adding a penalty for violating constraints. Suppose we want
A penalty method solves instead where:
- \(\phi(r)\) is \(0\) when \(r \le 0\) (feasible),
- \(\phi(r)\) grows when \(r>0\) (infeasible),
- \(\rho > 0\) is a penalty weight.
As \(\rho \to \infty\), infeasible points become extremely expensive, so minimisers approach feasibility.
This is conceptually simple and is sometimes effective, but:
- choosing \(\rho\) is tricky,
- very large \(\rho\) can make the landscape ill-conditioned and hard for gradient/Newton to solve.
Algorithm: Basic Penalty Method (Quadratic or General Penalization)¶
Goal: Solve
Penalty formulation:
where
- \(\phi(r) = 0\) if \(r \le 0\),
- \(\phi(r)\) grows when \(r>0\) (e.g., \(\phi(r)=\max\{0,r\}^2\)),
- \(\rho > 0\) is the penalty weight.
Inputs:
- objective \(f(x)\)
- constraints \(g_i(x)\)
- penalty function \(\phi\)
- initial point \(x_0\)
- initial penalty parameter \(\rho_0 > 0\)
- penalty update factor \(\gamma > 1\)
- tolerance \(\varepsilon\)
Procedure:
- Choose \(x_0\), \(\rho_0 > 0\).
- For \(k = 0, 1, 2, \dots\):
- Solve the penalized subproblem \(x_{k+1} = \arg\min_x F_{\rho_k}(x)\) using Newton’s method, gradient descent, quasi-Newton, etc.
- Check feasibility / stopping: If \(\max_i g_i(x_{k+1}) \le \varepsilon, \quad \|x_{k+1} - x_k\| \le \varepsilon\) stop and return \(x_{k+1}\).
- Increase penalty parameter \(\rho_{k+1} = \gamma\, \rho_k\) with typical \(\gamma \in [5,10]\).
- End.
Barrier methods¶
Penalty methods penalise violation after you cross the boundary. Barrier methods make it impossible to even touch the boundary. For inequality constraints \(g_i(x) \le 0\), define the logarithmic barrier This is finite only if \(g_i(x) < 0\) for all \(i\), i.e. \(x\) is strictly feasible. As you approach the boundary \(g_i(x)=0\), \(b(x)\) blows up to \(+\infty\).
We then solve, for a sequence of increasing parameters \(t\): subject to strict feasibility \(g_i(x)<0\).
As \(t \to \infty\), minimisers of \(F_t\) approach the true constrained optimum. The path of minimisers \(x^*(t)\) is called the central path.
Key points:
- \(F_t\) is smooth on the interior of the feasible region.
- We can apply Newton’s method to \(F_t\).
- Each Newton step solves a linear system involving the Hessian of \(F_t\), so the inner loop looks like a damped Newton method.
- Increasing \(t\) tightens the approximation; we “home in” on the boundary of feasibility.
Algorithm: Barrier Method (Logarithmic Barrier / Interior Approximation)¶
Goal: Solve the constrained problem
Logarithmic barrier:
defined only for strictly feasible points \(g_i(x)<0\).
Barrier subproblem:
where \(t>0\) is the barrier parameter.
As \(t \to \infty\), minimizers of \(F_t\) approach the constrained optimum.
Inputs:
- objective \(f(x)\)
- inequality constraints \(g_i(x)\)
- barrier function \(b(x)\)
- strictly feasible starting point \(x_0\) (\(g_i(x_0) < 0\))
- initial barrier parameter \(t_0 > 0\)
- barrier growth factor \(\mu > 1\) (often \(\mu = 10\))
- tolerance \(\varepsilon\)
Procedure:
- Choose strictly feasible \(x_0\), and pick \(t_0 > 0\).
- For \(k = 0,1,2,\dots\):
- Centering step (inner loop): Solve the barrier subproblem Typically use Newton’s method (damped) on \(F_{t_k}\). Stop when the Newton decrement satisfies \(\lambda(x_{k+1})^2/2 \le \varepsilon\)
- Optimality / stopping test: If \(\frac{m}{t_k} \le \varepsilon,\) then \(x_{k+1}\) is an \(\varepsilon\)-approximate solution of the original constrained problem; stop and return \(x_{k+1}\).
- Increase barrier parameter: \(t_{k+1} = \mu\, t_k,\) which tightens the approximation and moves closer to the boundary.
- End.
Interior-point methods¶
Interior-point methods combine barrier functions with Newton’s method to solve general convex programs:
- They maintain strict feasibility throughout.
- Each iteration solves a Newton system for the barrier-augmented objective.
- They naturally generate primal–dual pairs and duality gap estimates.
- Under standard assumptions (e.g., Slater’s condition), they converge in a predictable number of iterations.
Interior-point methods are the foundation of modern solvers for LP, QP, SOCP, and SDP. They are more expensive per iteration than first-order methods but converge in far fewer steps and achieve high accuracy.
Algorithm: Primal–Dual Interior-Point Method (for convex inequality constraints)¶
We consider the problem
Introduce Lagrange multipliers \(\lambda \ge 0\). The KKT conditions are
Interior-point methods enforce the relaxed condition which keeps iterates strictly feasible.
Inputs¶
- objective \(f(x)\)
- inequality constraints \(g_i(x)\)
- initial primal point \(x_0\) with \(g_i(x_0)<0\)
- initial dual variable \(\lambda_0 > 0\)
- initial barrier parameter \(t_0 > 0\)
- growth factor \(\mu > 1\)
- tolerance \(\varepsilon\)
Procedure¶
-
Choose strictly feasible \(x_0\), positive \(\lambda_0\), and \(t_0\).
-
For \(k = 0,1,2,\dots\):
(a) Form the perturbed KKT system. Solve for the Newton direction \((\Delta x, \Delta \lambda)\):
(b) Line search to keep strict feasibility. Choose the maximum \(\alpha\in(0,1]\) such that:
- \(g_i(x + \alpha \Delta x) < 0\),
- \(\lambda + \alpha \Delta \lambda > 0\).
(c) Update: \(x \leftarrow x + \alpha \Delta x, \qquad \lambda \leftarrow \lambda + \alpha \Delta \lambda.\)
(d) Check duality gap: \(\text{gap} = - g(x)^\top \lambda\) If \(\text{gap} \le \varepsilon\), stop.
(e) Increase barrier parameter \(t \leftarrow \mu t.\)
-
Return \(x\).
Choosing the right method in practice¶
Case A. Smooth, unconstrained, very high dimensional.
Example: logistic regression on millions of samples.
Use: gradient descent or (better) accelerated gradient.
Why: cheap iterations, easy to implement, scales.
Case B. Smooth, unconstrained, moderate dimensional, need high accuracy.
Example: convex nonlinear fitting with well-behaved Hessian.
Use: Newton or quasi-Newton.
Why: quadratic (or near-quadratic) convergence near optimum.
Case C. Convex with simple feasible set \(x \in \mathcal{X}\) (box, ball, simplex).
Use: projected gradient.
Why: projection is easy, maintains feasibility at each step.
Case D. Composite objective \(f(x) + R(x)\) where \(R\) is nonsmooth (e.g. \(\ell_1\), indicator of a constraint set).
Use: proximal gradient.
Why: prox handles nonsmooth/constraint part exactly each step.
Case E. General convex program with inequalities \(g_i(x)\le 0\).
Use: interior-point methods.
Why: they solve smooth barrier subproblems via Newton steps and give primal–dual certificates through KKT and duality (Chapters 7–8).
Mental Map¶
Algorithms for Convex Optimization
Turning convex geometry + optimality conditions into solvers
│
▼
Core question: how do we actually solve problems?
Choose an algorithm class that matches problem structure + scale
│
▼
┌────────────────────────────────────────────────────────────┐
│ Problem Classes → Method Classes │
│ Smooth unconstrained: GD, Acceleration, Newton │
│ Smooth + simple constraints: Projected gradient │
│ Composite (smooth+nonsmooth): Proximal/coordinate/splitting│
│ General constrained convex: Interior-point / primal–dual │
└────────────────────────────────────────────────────────────┘
│
▼
┌───────────────────────────────────────────────────────────┐
│ First-Order Core: Gradient Descent │
│ x_{k+1} = x_k − α_k ∇f(x_k) │
│ Requirements: convex + L-smooth │
│ - Step size from L or line search │
│ - Rate (smooth convex): O(1/k) │
│ Geometry: move opposite supporting hyperplane slope
└───────────────────────────────────────────────────────────┘
│
▼
┌───────────────────────────────────────────────────────────┐
│ Acceleration (Nesterov / Momentum) │
│ Two sequences: gradient at y_k + extrapolation to y_{k+1} │
│ - Rate (smooth convex): O(1/k^2) │
│ - Best possible for gradient-only methods │
│ Tradeoff: faster but more sensitive to tuning/noise │
└───────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────┐
│ Nonsmooth First-Order: Subgradient Descent │
│ x_{k+1} = x_k − α_k g_k, g_k ∈ ∂f(x_k) │
│ - Works for convex nonsmooth objectives │
│ - Needs diminishing step sizes │
│ - Worst-case rate: O(1/√k) │
│ Use when: only subgradients available / simple implementation│
└──────────────────────────────────────────────────────────────┘
│
▼
┌───────────────────────────────────────────────────────────┐
│ Fixing Nonsmoothness: Proximal & Smoothing │
│ Prox operator: prox_{αR}(y)=argmin_x R(x)+(1/2α)‖x−y‖² │
│ - Handles ℓ₁, indicators, TV, etc. │
│ Smoothing: Huber / softplus / log-sum-exp │
│ - Enables fast smooth methods │
└───────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Steepest Descent = Gradient Descent under a chosen norm │
│ Δx_nsd = argmin_{‖v‖=1} ∇f(x)ᵀv │
│ - Dual norm controls gradient magnitude │
│ - ℓ₂ → standard GD │
│ - Quadratic norm → preconditioned GD / Newton-like │
│ - ℓ₁ → coordinate-like steps │
│ Purpose: change geometry to reduce anisotropy / improve rate│
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Quadratic Structure: Conjugate Gradient (CG) │
│ Solve: ½xᵀAx − bᵀx, A≻0 (equivalently Ax=b) │
│ - Builds A-conjugate directions (Hessian-orthogonal) │
│ - Uses only matrix–vector products Ap │
│ - Converges in ≤ n steps (exact arithmetic) │
│ - Practical iterations ~ O(√κ) with κ=λ_max/λ_min │
│ - Preconditioning is key for speed │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Second-Order Methods: Newton & Variants │
│ Newton step: ∇²f(x_k) d = −∇f(x_k) │
│ - Quadratic local model → fast local convergence │
│ - Needs line search / trust region for robustness │
│ Gauss–Newton / Levenberg–Marquardt: least-squares structure │
│ Quasi-Newton (BFGS/L-BFGS): Hessian inverse approximations │
│ Use when: moderate dimension or efficient linear solves │
└───────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────┐
│ Handling Constraints: Projection & Proximal Splitting │
│ Projected GD: y=x−α∇f(x); x⁺=Π_X(y) │
│ Proximal gradient: y=x−α∇f(x); x⁺=prox_{αR}(y) │
│ Unification: indicator_R(X) ⇒ prox = projection │
│ Use when: constraints/regularizers have cheap prox/project │
└──────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ General Inequalities: Penalties → Barriers → Interior-Point │
│ Penalty: f(x)+ρ Σ φ(g_i(x)) │
│ Barrier: tf(x) − Σ log(−g_i(x)) (strict feasibility) │
│ Interior-point: Newton steps on (primal–dual) perturbed KKT │
│ - Few iterations, high accuracy, solver backbone for LP/QP │
│ - Uses duality gap certificates │
└─────────────────────────────────────────────────────────────┘
│
▼
Practical selection (the decision logic)
┌─────────────────────────────────────────────────────────────┐
│ Huge-scale smooth → GD / accelerated / L-BFGS │
│ Moderate smooth, high accuracy → Newton / quasi-Newton │
│ Simple constraints → projected gradient │
│ Smooth + nonsmooth regularizer → proximal gradient / FISTA │
│ General constraints → interior-point / primal–dual │
│ Quadratic / SPD linear systems → CG (+ preconditioning) │
└─────────────────────────────────────────────────────────────┘
Chapter 13: Optimization Algorithms for Equality-Constrained Problems¶
Equality-constrained optimization arises whenever variables must satisfy exact relationships, such as conservation laws, normalization, or linear invariants. In this chapter we focus on problems of the form
where \(f : \mathbb{R}^n \to \mathbb{R}\) is (typically convex and differentiable) and \(A \in \mathbb{R}^{p \times n}\) has rank \(p\). This linear equality structure appears in constrained least squares, portfolio optimization, and many ML formulations that impose exact balance or normalization constraints.
Geometric View — Optimization on an Affine Manifold¶
The constraint \(A x = b\) defines an affine set
If \(\operatorname{rank}(A) = p\), then \(\mathcal{X}\) is an \((n-p)\)-dimensional affine subspace of \(\mathbb{R}^n\): a “flat” lower-dimensional plane embedded in the ambient space. Optimization now happens along this plane, not in all of \(\mathbb{R}^n\). Any feasible direction \(d\) must keep us in \(\mathcal{X}\), so it must satisfy
Thus, feasible directions lie in the null space of \(A\):
At an optimal point \(x^\star \in \mathcal{X}\), moving in any feasible direction \(d\) cannot decrease \(f\). For differentiable \(f\), this means
Equivalently, \(\nabla f(x^\star)\) must be orthogonal to all feasible directions, i.e. it lies in the row space of \(A\). Therefore there exists a vector of Lagrange multipliers \(\nu^\star\) such that
This is the basic geometric optimality condition: at the optimum, the gradient of \(f\) is a linear combination of the constraint normals (rows of \(A\)), and every feasible direction is orthogonal to \(\nabla f(x^\star)\).
Lagrange Function and KKT System¶
The Lagrangian for the equality-constrained problem is
where \(\nu \in \mathbb{R}^p\) are Lagrange multipliers. The first-order (KKT) conditions for a point \((x^\star,\nu^\star)\) to be optimal are
When \(f\) is convex and \(A\) has full row rank, these conditions are necessary and sufficient for global optimality. For Newton-type methods we linearize these conditions around a current iterate \((x,\nu)\) and solve for corrections \((\Delta x,\Delta \nu)\) from
This linear system is called the (equality-constrained) KKT system. At the optimum the right-hand side is zero.
Quadratic Objectives¶
A particularly important case is a convex quadratic objective
with \(P \succeq 0\). The equality-constrained problem
has KKT conditions
If \(P \succ 0\) and \(A\) has full row rank, this system has a unique solution \((x^\star,\nu^\star)\). This is the standard linear system solved in equality-constrained least squares and quadratic programming.
Examples in ML and statistics:
- constrained least squares with sum-to-one constraints on coefficients;
- portfolio optimization with ;
- quadratic surrogate subproblems inside second-order methods.
The structure of the KKT matrix (symmetric, indefinite, with blocks \(P\), \(A\)) can be exploited by specialized linear solvers and factorizations.
Null-Space (Reduced Variable) Method¶
When the constraints are linear and of full row rank, a natural approach is to eliminate them explicitly.
Choose:
- a particular feasible point \(x_0\) satisfying \(A x_0 = b\),
- a matrix \(Z \in \mathbb{R}^{n \times (n-p)}\) whose columns form a basis of the null space of \(A\):
Then every feasible \(x\) can be written as
Substituting into the objective yields an unconstrained reduced problem in the smaller variable \(y\):
Gradients and Hessians transform as
We can now apply any unconstrained method (gradient descent, CG, Newton) to \(\phi(y)\). The corresponding updates in the original space are mapped back via \(x = x_0 + Z y\).
Key points:
- Optimization is restricted to feasible directions \(\operatorname{Null}(A)\) by construction.
- The dimension drops from \(n\) to \(n-p\), which can be advantageous if \(p\) is large.
- The cost is computing and storing a suitable null-space basis \(Z\), which may destroy sparsity and be expensive for large-scale problems.
Null-space methods are attractive when:
- the number of constraints is moderate,
- a good factorization of \(A\) is available,
- and we want an unconstrained algorithm in reduced coordinates.
Newton’s Method for Equality-Constrained Problems¶
For a twice-differentiable convex \(f\), we can derive an equality-constrained Newton step by solving a local quadratic approximation subject to linearized constraints.
At a point \(x\), approximate \(f(x+d)\) by its second-order Taylor expansion:
We seek a step \(d\) that approximately minimizes this quadratic model while remaining feasible to first order, i.e.
The KKT conditions for this quadratic subproblem are
Solving this system gives the Newton step \(d_{\text{nt}}\) and a multiplier update \(\lambda\). The primal update is
with a step size \(\alpha_k \in (0,1]\) chosen by line search to ensure sufficient decrease and preservation of feasibility (for equality constraints, \(A d_{\text{nt}} = 0\) guarantees \(A x_{k+1} = b\) whenever \(A x_k = b\)).
Geometrically:
- unconstrained Newton would move by \(-\nabla^2 f(x)^{-1} \nabla f(x)\);
- equality-constrained Newton projects this step onto the tangent space \(\{ d : A d = 0 \}\) of the affine constraint set.
For strictly convex \(f\) with positive definite Hessian on the feasible directions, this method enjoys quadratic convergence near the solution, much like the unconstrained Newton method.
Connections to Machine Learning and Signal Processing¶
Linear equality constraints appear naturally in ML and related areas:
| Setting | Equality constraint | Interpretation |
|---|---|---|
| Portfolio optimization | \(\mathbf{1}^\top w = 1\) | Weights sum to one (full investment) |
| Constrained regression | \(C x = d\) | Enforce domain-specific linear relations between coefficients |
| Mixture models / convex combinations | \(\mathbf{1}^\top \alpha = 1, \; \alpha \ge 0\) | Mixture weights form a probability simplex |
| Fairness constraints (linearized) | \(A w = 0\) | Enforce equal averages across groups or balance conditions |
| Physics-informed models (discretized) | \(A x = b\) | Discrete conservation laws (mass, charge, energy) |
More generally, nonlinear equality constraints (e.g. \(W^\top W = I\) for orthonormal embeddings, or \(\|w\|_2^2 = 1\) for normalized weights) lead to optimization on curved manifolds. Techniques from this chapter extend to those settings when combined with Riemannian optimization or local parameterizations, but here we focus on the linear case as the fundamental building block.
Chapter 14: Optimization Algorithms for Inequality-Constrained Problems¶
In many applications, we must optimize an objective while respecting inequality constraints: nonnegativity of variables, margin constraints in SVMs, capacity or safety limits, physical bounds, fairness budgets, and more. Mathematically, the feasible region is now a convex set with a boundary, and the optimizer often lies on that boundary.
This chapter introduces algorithms for solving such problems, focusing on logarithmic barrier and interior-point methods. These are the workhorses behind modern general-purpose convex solvers (for LP, QP, SOCP, SDP) and provide a smooth way to enforce inequalities while still using Newton-type methods.
Problem Setup¶
We consider the general convex problem with inequality and equality constraints where
- \(f_0, f_1,\dots,f_m\) are convex, typically twice differentiable,
- \(A \in \mathbb{R}^{p \times n}\) has full row rank,
- there exists a strictly feasible point \(\bar{x}\) such that \(f_i(\bar{x}) < 0\) for all \(i\) and \(A\bar{x} = b\) (Slater’s condition).
Under these assumptions:
- the problem is convex,
- strong duality holds (zero duality gap),
- and the KKT conditions characterize optimality.
Examples¶
| Problem type | \(f_0(x)\) | Constraints \(f_i(x)\le0\) | ML / applications |
|---|---|---|---|
| Linear program (LP) | \(c^\top x\) | \(a_i^\top x - b_i \le 0\) | resource allocation, feature selection |
| Quadratic program | \(\tfrac12 x^\top P x + q^\top x\) | linear | SVMs, ridge with box constraints |
| QCQP | quadratic | quadratic | portfolio optimization, control |
| Entropy models | \(\sum_i x_i \log x_i\) | \(F x - g \le 0\) | probability calibration, max-entropy |
| Nonnegativity | arbitrary convex | \(-x_i \le 0\) | sparse coding, nonnegative factorization |
Many machine-learning training problems can be written in this template by expressing regularization, margins, fairness, or safety conditions as convex inequalities.
Indicator-Function View of Constraints¶
Conceptually, we can write inequality constraints using an indicator function. Define
Then the inequality-constrained problem is equivalent to
- If \(x\) is feasible (\(f_i(x) \le 0\) for all \(i\)), the indicators contribute \(0\).
- If any constraint is violated (\(f_i(x) > 0\)), the objective becomes \(+\infty\).
This formulation is clean but not numerically friendly:
- \(I_{-}\) is discontinuous and nonsmooth.
- We cannot directly apply Newton-type methods.
The key idea of barrier methods is to replace the hard indicator with a smooth approximation that grows to \(+\infty\) as we approach the boundary.
Logarithmic Barrier Approximation¶
We approximate the indicator \(I_{-}\) with a smooth barrier function
For each inequality \(f_i(x) \le 0\), we introduce a barrier term \(-\log(-f_i(x))\). For a given parameter \(t > 0\), we solve the barrier subproblem where
Equivalently,
Interpretation:
- The barrier term \(\phi(x)\) is finite only for strictly feasible points (\(f_i(x) < 0\)).
- As \(x\) approaches the boundary \(f_i(x) \to 0^-\), the term \(-\log(-f_i(x)) \to +\infty\).
- The parameter \(t\) controls the trade-off:
- small \(t\) (large \(1/t\)) → strong barrier, solution stays deep inside the feasible set;
- large \(t\) → barrier is weaker, solutions can move closer to the boundary.
As \(t \to \infty\), solutions of the barrier subproblem approach the solution of the original constrained problem.
Derivatives of the Barrier¶
Let Then \(\phi\) is convex and twice differentiable on its domain. Its gradient and Hessian are
Key features:
- As \(f_i(x) \uparrow 0\) (approaching the boundary from inside), the factor \(1/(-f_i(x))\) blows up, so \(\|\nabla \phi(x)\|\) becomes very large.
- This creates a strong repulsive force that prevents iterates from crossing the boundary.
- The barrier “pushes” the solution away from constraint violation, while the original objective \(f_0(x)\) pulls toward lower cost.
The barrier subproblem is a smooth equality-constrained problem. We can therefore apply equality-constrained Newton methods (Chapter 13) at each fixed \(t\).
Central Path and Approximate KKT Conditions¶
For each \(t > 0\), let \(x^\star(t)\) be a minimizer of the barrier problem
The set \(\{x^\star(t) : t > 0\}\) is called the central path. As \(t \to \infty\), \(x^\star(t)\) converges to a solution \(x^\star\) of the original inequality-constrained problem.
We can associate approximate dual variables to \(x^\star(t)\):
Then the KKT-like relations hold:
Compare with the exact KKT conditions (for optimal \((x^\star,\lambda^\star,v^\star)\)):
Along the central path we have the relaxed complementarity condition which tends to \(0\) as \(t \to \infty\). Hence the barrier formulation naturally yields approximate primal–dual solutions whose KKT residuals shrink as we increase \(t\).
Geometric and Physical Intuition¶
Consider the barrier-augmented objective
We can interpret this as:
- \(t f_0(x)\): an “external potential” pulling us toward low objective values.
- \(-\log(-f_i(x))\): repulsive potentials that become infinite near the boundary \(f_i(x)=0\).
At a minimizer \(x^\star(t)\), we have
The gradient of the objective is exactly balanced by a weighted sum of constraint gradients. This is a force-balance condition:
- constraints “push back” more strongly when \(x\) is close to their boundary,
- the interior-point iterates follow a smooth path that stays strictly feasible and moves gradually toward the optimal boundary point.
This picture explains both:
- why iterates never leave the feasible region, and
- why the method naturally generates dual variables (the weights on constraint gradients).
The Barrier Method¶
The barrier method solves the original inequality-constrained problem by solving a sequence of barrier subproblems with increasing \(t\).
Algorithm: Barrier Method (Conceptual Form)¶
Given:
- a strictly feasible starting point \(x\) (\(f_i(x) < 0\), \(A x = b\)),
- initial barrier parameter \(t > 0\),
- barrier growth factor \(\mu > 1\) (e.g. \(\mu \in [10,20]\)),
- accuracy tolerance \(\varepsilon > 0\),
repeat:
-
Centering step
Solve the equality-constrained problem using an equality-constrained Newton method.
(In practice, we start from the previous solution and take a small number of Newton steps rather than “solve exactly”.) -
Update iterate
Let \(x\) be the resulting point (the approximate minimizer for current \(t\)). -
Check stopping criterion
For the barrier problem, one can show where \(p^\star\) is the optimal value of the original problem.
If then stop: \(x\) is guaranteed to be within \(\varepsilon\) (in objective value) of optimal. -
Increase \(t\)
Set \(t := \mu t\) to weaken the barrier and move closer to the true boundary, then go back to Step 1.
Key parameters:
| Symbol | Role |
|---|---|
| \(t\) | barrier strength (larger \(t\) = weaker barrier, closer to solution) |
| \(\mu\) | growth factor for \(t\) |
| \(\varepsilon\) | desired accuracy (duality-gap based) |
| \(m\) | number of inequality constraints |
In practice:
- \(\varepsilon\) is often in the range \(10^{-3}\)–\(10^{-8}\),
- \(\mu\) is chosen to balance outer iterations vs inner Newton steps,
- the centering step is usually solved to modest accuracy, not exactness.
From Barrier Methods to Interior-Point Methods¶
Pure barrier methods conceptually “solve a sequence of problems for increasing \(t\)”. Modern interior-point methods refine this idea:
- they update both primal variables \(x\) and dual variables \((\lambda, v)\),
- they use Newton’s method on the (perturbed) KKT system,
- they follow the central path by simultaneously enforcing:
- primal feasibility (\(f_i(x) \le 0\), \(A x = b\)),
- dual feasibility (\(\lambda_i \ge 0\)),
- relaxed complementarity (\(-\lambda_i f_i(x) \approx 1/t\)).
A typical primal–dual step solves a linearized KKT system of the form
Newton’s method applied to these equations yields search directions for \((x,\lambda,v)\) that move toward the central path and reduce primal and dual residuals simultaneously. This is what modern LP/QP/SOCP/SDP solvers implement.
You do not need to implement these methods from scratch to use them: in practice, you describe your problem in a modeling language (e.g. CVX, CVXPY, JuMP) and rely on an interior-point solver under the hood.
Computational and Practical Notes¶
Some important practical aspects:
-
Equality-constrained Newton inside
Each barrier subproblem is solved by equality-constrained Newton (Chapter 13). The main cost is solving the KKT linear system at each Newton step. -
Strict feasibility
Barrier and interior-point methods require a strictly feasible starting point \(x\) with \(f_i(x) < 0\). - Sometimes this is easy (e.g. nonnegativity constraints with a positive initial vector).
-
Otherwise, a separate phase I problem is solved to find such a point or to certify infeasibility.
-
Step size control
Because the barrier blows up near the boundary, too aggressive Newton steps may try to leave the feasible region. A backtracking line search is used to ensure: - sufficient decrease in the barrier objective,
-
and preservation of strict feasibility (\(f_i(x) < 0\) remains true).
-
Accuracy vs cost
The duality-gap bound \(m/t\) provides a clear trade-off: - small \(m/t\) (large \(t\)) → high accuracy, more iterations,
-
larger \(m/t\) → faster but less precise.
-
Sparsity and structure
For large problems, exploiting sparsity in \(A\) and in the Hessians \(\nabla^2 f_i(x)\) is crucial. Interior-point methods scale well when linear algebra is carefully optimized.
Equality vs Inequality-Constrained Algorithms¶
Finally, it is helpful to contrast the equality-only case (Chapter 13) with the inequality case.
| Aspect | Equality constraints \(A x = b\) | Inequality constraints \(f_i(x) \le 0\) |
|---|---|---|
| Feasible set | Affine subspace | General convex region with boundary |
| Typical algorithms | Lagrange/KKT, equality-constrained Newton, null-space | Barrier methods, primal–dual interior-point methods |
| Feasibility during iteration | Can start infeasible and converge to \(A x = b\) | Iterates kept strictly feasible (\(f_i(x) < 0\)) |
| Complementarity | Not present (only equalities) | \(\lambda_i f_i(x) = 0\) at optimum, or \(\approx -1/t\) along central path |
| Geometric picture | Optimization on a flat manifold | Optimization in a convex region, repelled from boundary |
| ML relevance | Normalization, linear invariants, balance constraints | Nonnegativity, margin constraints, safety/fairness limits |
In summary:
- Equality-constrained methods operate directly on an affine manifold using KKT and Newton.
- Inequality-constrained methods use smooth barriers (or primal–dual perturbed KKT systems) to stay in the interior and gradually approach the boundary and the optimal point.
Interior-point methods unify these perspectives and are the backbone of modern convex optimization software.
Chapter 15: Advanced Large-Scale and Structured Methods¶
Modern convex optimization often runs at massive scale: millions (or billions) of variables, datasets too large to fit in memory, and constraints spread across machines or devices. Per-iteration cost and memory usage often of often makes classical solutions impractical for these regimes.
This chapter introduces methods that exploit structure, sparsity, separability, and stochasticity to make convex optimization scalable. These ideas underpin the optimization engines behind most modern machine learning systems.
Motivation: Structure and Scale¶
In large-scale convex optimization, the challenge is not “does a solution exist?” but rather “can we compute it in time and memory?”.
Bottlenecks include:
- Memory: storing Hessians (or even full gradients) may be impossible.
- Data size: one full pass over all samples can already be expensive.
- Distributed data: samples are spread across devices / workers.
- Sparsity and separability: the problem often decomposes into many small pieces.
A common template is the empirical risk + regularizer form where
- each \(f_i(x)\) is a loss term for sample \(i\),
- \(R(x)\) is a regularizer (possibly nonsmooth, e.g. \(\lambda\|x\|_1\)).
The methods in this chapter are designed to exploit this structure:
- update only parts of \(x\) at a time (coordinate/block methods),
- use only some data per step (stochastic methods),
- split the problem into simpler subproblems (proximal / ADMM),
- or distribute computation across multiple machines (consensus methods).
Coordinate Descent¶
Coordinate descent updates one coordinate (or a small block of coordinates) at a time, holding all others fixed. It is especially effective when updates along a single coordinate are cheap to compute. Given \(x^{(k)}\), choose coordinate \(i_k\) and define
In practice:
- \(i_k\) is chosen either cyclically (\(1,2,\dots,n,1,2,\dots\)) or randomly.
- Each coordinate update often has a closed form (e.g. soft-thresholding for LASSO).
- You never form or store the full gradient; you only need partial derivatives.
Why it scales:
- Each step is very cheap — often \(O(1)\) or proportional to the number of nonzeros in the column corresponding to coordinate \(i_k\).
- In high dimensions (e.g., millions of features), this can be far more efficient than updating all coordinates at once.
Convergence:
If \(f\) is convex and has Lipschitz-continuous partial derivatives, coordinate descent (cyclic or randomized) converges to the global minimizer. Randomized coordinate descent often has clean expected convergence rates.
ML context:
- LASSO / Elastic Net regression (coordinate updates are soft-thresholding),
- \(\ell_1\)-penalized logistic regression,
- matrix factorization and dictionary learning (updating one factor vector at a time),
- problems where \(R(x)\) is separable: \(R(x) = \sum_i R_i(x_i)\).
Stochastic Gradient and Variance-Reduced Methods¶
When \(N\) (number of samples) is huge, computing the full gradient every iteration is too expensive. Stochastic methods replace this full gradient with cheap, unbiased estimates based on small random subsets (mini-batches) of data.
Stochastic Gradient Descent (SGD)¶
At iteration \(k\):
- Sample a mini-batch \(\mathcal{B}_k \subset \{1,\dots,N\}\).
- Form the stochastic gradient
- Update where \(\eta_k > 0\) is the learning rate.
Properties:
- \(\mathbb{E}[\widehat{\nabla f}(x_k) \mid x_k] = \nabla f(x_k)\) (unbiased),
- Each iteration is cheap (depends only on \(|\mathcal{B}_k|\), not \(N\)),
- The noise can help escape shallow nonconvex traps (in deep learning).
In convex settings, SGD trades off per-iteration cost against convergence speed: many cheap noisy steps instead of fewer expensive precise ones.
Step Sizes and Averaging¶
The step size \(\eta_k\) is crucial:
- Too large → iterates diverge or oscillate.
- Too small → extremely slow progress.
Typical schedules for convex problems:
- General convex: \(\eta_k = \frac{c}{\sqrt{k}}\),
- Strongly convex: \(\eta_k = \frac{c}{k}\).
Two important stabilization techniques:
- Decay the learning rate over time.
- Polyak–Ruppert averaging: return the average instead of the last iterate. Averaging cancels noise and leads to optimal \(O(1/k)\) rates in strongly convex settings.
Mini-batch size can also grow with \(k\), gradually reducing variance while keeping early iterations cheap.
Convergence Rates¶
For convex \(f\):
- With appropriate diminishing \(\eta_k\),
\(\mathbb{E}[f(x_k)] - f^\star = O(k^{-1/2})\).
For strongly convex \(f\):
- With \(\eta_k = O(1/k)\) and averaging,
\(\mathbb{E}[\|x_k - x^\star\|^2] = O(1/k)\).
These rates are optimal for unbiased first-order stochastic methods.
Variance-Reduced Methods (SVRG, SAGA, SARAH)¶
Plain SGD cannot easily reach very high accuracy because the gradient noise never fully disappears. Variance-reduced methods reduce this noise, especially near the solution, by periodically using the full gradient.
Example: SVRG (Stochastic Variance-Reduced Gradient)
- Pick a reference point \(\tilde{x}\) and compute \(\nabla f(\tilde{x})\).
- For inner iterations: where \(i_k\) is a random sample index.
Here \(v_k\) is still an unbiased estimator of \(\nabla f(x_k)\), but its variance decays as \(x_k\) approaches \(\tilde{x}\). For strongly convex \(f\), methods like SVRG and SAGA achieve linear convergence, comparable to full gradient descent but at near-SGD cost.
Momentum and Adaptive Methods¶
In practice, large-scale learning often uses SGD with various modifications:
-
Momentum / Nesterov: keep a moving average of gradients which accelerates progress along consistent directions and damps oscillations.
-
Adaptive methods (Adagrad, RMSProp, Adam): maintain coordinate-wise scales based on past squared gradients, effectively using a diagonal preconditioner that adapts to curvature and feature scales.
These methods are especially popular in deep learning. For convex problems, their theoretical behavior is subtle, but empirically they often converge faster in wall-clock time.
Proximal and Composite Optimization¶
Many large-scale convex problems are composite: where
- \(g\) is smooth convex with Lipschitz gradient (e.g. data-fitting term),
- \(R\) is convex but possibly nonsmooth (e.g. \(\lambda\|x\|_1\), indicator of a constraint set, nuclear norm).
The proximal gradient method (a.k.a. ISTA) updates as where the proximal operator is
Intuition:
- The gradient step moves \(x\) in a direction that lowers the smooth term \(g\).
- The prox step solves a small “regularized” problem, pulling \(x\) toward a structure favored by \(R\) (sparsity, low rank, feasibility, etc.).
Examples of prox operators:
- \(R(x) = \lambda\|x\|_1\) → soft-thresholding (coordinate-wise shrinkage).
- \(R\) = indicator of a convex set \(\mathcal{X}\) → projection onto \(\mathcal{X}\) (so projected gradient is a special case).
- \(R(X) = \|X\|_*\) (nuclear norm) → singular value soft-thresholding.
For large-scale problems:
- Proximal gradient scales like gradient descent: each iteration uses only \(\nabla g\) and a prox (often cheap and parallelizable).
- Accelerated variants (FISTA) achieve \(O(1/k^2)\) rates for smooth \(g\).
Alternating Direction Method of Multipliers (ADMM)¶
When objectives naturally split into simpler pieces depending on different variables, ADMM is a powerful tool. It is especially useful when:
- \(f\) and \(g\) have simple prox operators,
- the problem is distributed or separable across machines.
Consider
The augmented Lagrangian is with dual variable \(y\) and penalty parameter \(\rho > 0\).
ADMM performs the iterations:
Interpretation:
- The \(x\)-update solves a subproblem involving \(f\) only.
- The \(z\)-update solves a subproblem involving \(g\) only.
- The \(y\)-update nudges the constraint \(A x + B z = c\) toward satisfaction.
For convex \(f,g\), ADMM converges to a primal–dual optimal point. It is particularly effective when the \(x\)- and \(z\)-subproblems have closed-form prox solutions or can be solved cheaply in parallel.
ML use cases:
- Distributed LASSO / logistic regression,
- matrix completion and robust PCA,
- consensus optimization (each worker has local data but shares a global model),
- some federated learning formulations.
Majorization–Minimization (MM) and EM Algorithms¶
The Majorization–Minimization (MM) principle constructs at each iterate \(x_k\) a surrogate function \(g(\cdot \mid x_k)\) that upper-bounds \(f\) and is easier to minimize.
Requirements:
Then define
This guarantees monotone decrease:
The famous Expectation–Maximization (EM) algorithm is an MM method for latent-variable models, where the surrogate arises from Jensen’s inequality and missing-data structure.
Other examples:
- Iteratively Reweighted Least Squares (IRLS) for logistic regression and robust regression,
- MM surrogates for nonconvex penalties (e.g. smoothly approximating \(\ell_0\)),
- mixture models and variational inference.
Distributed and Parallel Optimization¶
When data or variables are split across machines, we need distributed or parallel optimization schemes.
Synchronous vs Asynchronous¶
- Synchronous methods: all workers compute local gradients or updates, then synchronize (e.g. parameter server, federated averaging).
- Asynchronous methods: workers update parameters without global synchronization, improving hardware utilization but introducing staleness and variance.
Consensus Optimization¶
A standard pattern is consensus form: where \(f_i\) is the local objective on worker \(i\) and \(z\) is the global consensus variable.
ADMM applied to this problem:
- Each worker updates its local \(x_i\) using only local data,
- The global variable \(z\) is updated by averaging or aggregation,
- Dual variables enforce agreement \(x_i \approx z\).
This template underlies many federated learning and parameter-server architectures.
ML Context¶
- Federated learning (phone/edge devices update local models and send summaries to a server),
- Large-scale convex optimization over sharded datasets,
- Distributed sparse regression, matrix factorization, and graphical model learning.
Handling Structure: Sparsity and Low Rank¶
Large-scale convex problems often have additional structure that we can exploit algorithmically:
| Structure | Typical Regularizer / Constraint | Algorithmic Benefit |
|---|---|---|
| Sparsity | \(\ell_1\), group lasso | Cheap coordinate updates, soft-thresholding |
| Low rank | Nuclear norm \(\|X\|_*\) | SVD-based prox; rank truncation |
| Block separability | \(\sum_i f_i(x_i)\) | Parallel or distributed block updates |
| Graph structure | Total variation on graphs | Local neighborhood computations |
| Probability simplex | simplex constraint or entropy term | Mirror descent, simplex projections |
Examples:
- In compressed sensing, \(\ell_1\) regularization + sparse sensing matrices → very cheap mat–vecs + prox operations.
- In matrix completion, nuclear norm structure + low-rank iterates → approximate SVD instead of full SVD.
- In TV denoising, local difference structure → each prox step involves only neighboring pixels/vertices.
Exploiting structure can yield orders-of-magnitude speedups compared to generic solvers.
Summary and Practical Guidance¶
Different large-scale methods are appropriate in different regimes:
| Method | Gradient Access | Scalability | Parallelization | Convexity Needed | Typical Uses |
|---|---|---|---|---|---|
| Coordinate Descent | Partial gradients | High | Easy (blockwise) | Convex | LASSO, sparse GLMs, matrix factorization |
| SGD / Mini-batch SGD | Stochastic gradients | Excellent | Natural (data parallel) | Convex / nonconvex | Deep learning, logistic regression |
| SVRG / SAGA (VR methods) | Stochastic + periodic full gradient | High | Data parallel | Convex, often strongly convex | Large-scale convex ML, GLMs |
| Proximal Gradient (ISTA/FISTA) | Full gradient + prox | Moderate–High | Easy | Convex | Composite objectives with structure |
| ADMM | Local subproblems | High | Designed for distributed | Convex | Consensus, distributed convex solvers |
| MM / EM | Surrogates | Moderate | Model-specific | Convex / nonconvex | Latent-variable models, IRLS |
| Distributed / Federated | Local gradients | Very high | Essential | Often convex / smooth | Federated learning, multi-agent systems |
Chapter 16: Modelling Patterns and Algorithm Selection¶
Real-world modelling starts not with algorithms but with data, assumptions, and design goals. We choose a loss function from statistical assumptions (e.g. noise model, likelihood) and a complexity penalty or constraints from design preferences (simplicity, robustness, etc.). The resulting convex (or nonconvex) optimization problem often tells us which solver class to use. In practice, solving machine learning problems looks like: modeling → recognize structure → pick solver. Familiar ML models (linear regression, logistic regression, etc.) can be viewed as convex programs. Below we survey common patterns (convex and some nonconvex) and the recommended algorithms/tricks for each.
Regularized estimation and the accuracy–simplicity tradeoff¶
Many learning tasks use a regularized risk minimization form: Here the loss measures fit to data (often from a likelihood) and the penalty (regularizer) enforces simplicity or structure. Increasing \(\lambda\) trades accuracy for simplicity (e.g. model sparsity or shrinkage).
-
Ridge regression (ℓ₂ penalty):
This arises from Gaussian noise (squared-error loss) plus a quadratic prior on \(x\). It is a smooth, strongly convex quadratic problem (Hessian \(A^TA + \lambda I \succ 0\)). One can solve it via Newton’s method or closed‐form normal equations, or for large problems via (accelerated) gradient descent or conjugate gradient. Strong convexity means fast, reliable convergence with second-order or accelerated first-order methods. -
LASSO / Sparse regression (ℓ₁ penalty):
The \(\ell_1\) penalty encourages many \(x_i=0\) (sparsity) for interpretability. The problem is convex but nonsmooth (since \(|\cdot|\) is nondifferentiable at 0). A standard solver is proximal gradient: take a gradient step on the smooth squared loss, then apply the proximal (soft-thresholding) step for \(\ell_1\), which sets small entries to zero. Coordinate descent is another popular solver – updating one coordinate at a time with a closed-form soft-thresholding step. Proximal methods and coordinate descent scale to very high dimensions. -
Elastic net (mixed ℓ₁+ℓ₂):
This combines the sparsity of LASSO with the stability of ridge regression. It is still convex and (for \(\lambda_2>0\)) strongly convex[^4]. One can still use proximal gradient (prox operator splits into soft-threshold and shrink) or coordinate descent. Because of the ℓ₂ term, the objective is smooth and unique solution. -
Group lasso, nuclear norm, etc.: Similar composite objectives arise when enforcing block-sparsity or low-rank structure. Each adds a convex penalty (block \(\ell_{2,1}\) norms, nuclear norm) to the loss. These remain convex, often separable or prox-friendly. Proximal methods (using known proximal maps for each norm) or ADMM can handle these.
Algorithms Summary:
- Smooth + ℓ₂ (strongly convex):
Newton / quasi-Newton, conjugate gradient, or accelerated gradient. - Smooth + ℓ₁ (and variants):
proximal gradient or coordinate descent; for huge data, stochastic/proximal variants. - Mixed penalties (ℓ₁ + ℓ₂):
treat as composite smooth + nonsmooth; prox and coordinate methods still apply. - Large \(N\) or \(n\):
mini-batch / stochastic gradients (SGD, SVRG, etc.) on the smooth part + prox for the regulariser.
Regularisation strength \(\lambda\) is usually chosen via cross-validation or a validation set, exploring the accuracy–simplicity trade-off.
Robust regression and outlier resistance¶
Standard least-squares uses squared loss, which penalizes large errors quadratically. This makes it sensitive to outliers. Robust alternatives replace the loss:
Least absolute deviations (ℓ₁ loss)¶
Formulation:
Interpretation:
- This corresponds to assuming Laplace (double-exponential) noise on the residuals.
- Unlike squared error, it penalizes big residuals linearly, not quadratically, so outliers hurt less.
Geometry/structure: - The objective is convex but nondifferentiable at zero residual (the kink in \(|r|\) at \(r=0\)).
How to solve it:
-
As a linear program (LP).
Introduce slack variables \(t_i \ge 0\) and rewrite:- constraints:
\(-t_i \le a_i^\top x - b_i \le t_i\), - objective:
\(\min \sum_i t_i\).
This is now a standard LP. You can solve it with:
- an interior-point LP solver,
- or simplex.
These methods give high-accuracy solutions and certificates.
- constraints:
-
First-order methods for large scale.
For very large problems (millions of samples/features), you can apply:
- subgradient methods,
- proximal methods (using the prox of \(|\cdot|\)).
These are slower in theory (subgradient is only \(O(1/\sqrt{t})\) convergence), but they scale to huge data where generic LP solvers would struggle.
Huber loss¶
Definition of the Huber penalty for residual \(r\):
Huber regression solves:
Interpretation:
- For small residuals (\(|r|\le\delta\)): it acts like least-squares (\(\tfrac{1}{2}r^2\)). So inliers are fit tightly.
- For large residuals (\(|r|>\delta\)): it acts like \(\ell_1\) (linear penalty), so outliers get down-weighted.
- Intuition: “be aggressive on normal data, be forgiving on outliers.”
Properties:
- \(\rho_\delta\) is convex.
- It is smooth except for a kink in its second derivative at \(|r|=\delta\).
- Its gradient exists everywhere (the function is once-differentiable).
How to solve it:
-
Iteratively Reweighted Least Squares (IRLS) / quasi-Newton.
Because the loss is basically quadratic near the solution, Newton-type methods (including IRLS) work well and converge fast on moderate-size problems. -
Proximal / first-order methods.
You can apply proximal gradient methods, since each term is simple and has a known prox. -
As a conic program (SOCP).
The Huber objective can be written with auxiliary variables and second-order cone constraints. That means you can feed it to an SOCP solver and let an interior-point method handle it efficiently and robustly. This is attractive when you want high accuracy and dual certificates.
Worst-case robust regression¶
Sometimes we don’t just want “fit the data we saw,” but “fit any data within some uncertainty set.” This leads to min–max problems of the form:
Meaning:
- \(\mathcal{U}\) is an uncertainty set describing how much you distrust the matrix \(A\), the inputs, or the measurements.
- You choose \(x\) that performs well even in the worst allowed perturbation.
Why this is still tractable:
-
If \(\mathcal{U}\) is convex (for example, an \(\ell_2\) ball or box bounds on each entry), then the inner maximization often has a closed-form expression.
-
That inner max usually turns into an extra norm penalty or a conic constraint in the outer problem.
- Example: if the rows of \(A\) can move within an \(\ell_2\) ball of radius \(\epsilon\), the robustified problem often picks up an additional \(\ell_2\) term like \(\gamma \|x\|_2\) in the objective.
- The final problem is still convex (often a QP or SOCP).
How to solve it:
-
If it reduces to an LP / QP / SOCP, you can use an interior-point (conic) solver to get a high-quality solution and dual certificate.
-
If the structure is separable and high-dimensional, you can sometimes solve the dual or a proximal/ADMM splitting of the robust problem using first-order methods.
Maximum likelihood and loss design¶
Choosing a loss often comes from a probabilistic noise model. The negative log-likelihood (NLL) of an assumed distribution gives a convex loss for many common cases:
-
Gaussian (normal) noise
Model:
The negative log-likelihood (NLL) is proportional to:
This recovers the classic least-squares loss (as in linear regression).
It is smooth and convex (strongly convex if \(A^T A\) is full rank).Algorithms:
-
Closed-form via \((A^T A + \lambda I)^{-1} A^T b\) (for ridge regression),
-
Iterative methods: conjugate gradient, gradient descent (Chapter 9),
-
Or Newton / quasi-Newton methods (Chapter 9) using the constant Hessian \(A^T A\).
-
-
Laplace (double-exponential) noise
If \(\varepsilon_i \sim \text{Laplace}(0, b)\) i.i.d., the NLL is proportional to:
This is exactly the ℓ₁ regression (least absolute deviations).
It can be solved as an LP or with robust optimization solvers (interior-point),
or with first-order nonsmooth methods (subgradient/proximal) for large-scale problems. -
Logistic model (binary classification)
For \(y_i \in \{0,1\}\), model:
The negative log-likelihood (logistic loss) is:
This loss is convex and smooth in \(x\).
No closed-form solution exists.Algorithms:
- With ℓ₂ regularization: smooth and (if \(\lambda>0\)) strongly convex → use accelerated gradient or quasi-Newton (e.g. L-BFGS).
- With ℓ₁ regularization (sparse logistic): composite convex → use proximal gradient (soft-thresholding) or coordinate descent.
-
Softmax / Multinomial logistic (multiclass)
For \(K\) classes with one-hot labels \(y_i \in \{e_1, \dots, e_K\}\), the softmax model gives NLL:
This loss is convex in the weight vectors \(\{x_k\}\) and generalizes binary logistic to multiclass.
Algorithms:
- Gradient-based solvers (L-BFGS, Newton with block Hessian) for moderate size.
- Stochastic gradient (SGD, Adam) for large datasets.
-
Generalized linear models (GLMs)
In GLMs, \(y_i\) given \(x\) has an exponential-family distribution (Poisson, binomial, etc.) with mean related to \(a_i^T x\).
The NLL is convex in \(x\) for canonical links (e.g. log-link for Poisson, logit for binomial).Examples:
- Poisson regression for counts: convex NLL, solved by IRLS or gradient.
- Probit models: convex but require iterative solvers.
Structured constraints in engineering and design¶
Optimization problems often include explicit convex constraints from physical or resource limits: e.g. variable bounds, norm limits, budget constraints. The solver choice depends on how easily we can handle projections or barriers for \(\mathcal{X}\):
-
Simple (projection-friendly) constraints
Examples:
-
Box constraints: \(l \le x \le u\)
→ Projection: clip each entry to \([\ell_i, u_i]\). -
ℓ₂-ball: \(\|x\|_2 \le R\)
→ Projection: rescale \(x\) if \(\|x\|_2 > R\). -
Simplex: \(\{x \ge 0, \sum_i x_i = 1\}\)
→ Projection: sort and threshold coordinates (simple \(O(n \log n)\) algorithm).
-
-
General convex constraints (non-projection-friendly) If constraints are complex (e.g. second-order cones, semidefinite, or many coupled inequalities), projections are hard. Two strategies:
-
Barrier / penalty and interior-point methods : Add a log-barrier or penalty and solve with an interior-point solver (Chapter 9). This handles general convex constraints well and returns dual variables (Lagrange multipliers) as a bonus.
-
Conic formulation + solver: Write the problem as an LP/QP/SOCP/SDP and use specialized solvers (like MOSEK, Gurobi) that exploit sparse structure. If only first-order methods are feasible for huge problems, one can apply dual decomposition or ADMM by splitting constraints (Chapter 10), but convergence will be slower.
-
Algorithmic pointers:
- Projection-friendly constraints → Projected (stochastic) gradient or proximal methods (fast, maintain feasibility).
- Complex constraints (cones, PSD, many linear) → Use interior-point/conic solvers (Chapter 9) for moderate size. Alternatively, use operator-splitting (ADMM) if parallel/distributed solution is needed (Chapter 10).
- LP/QP special cases → Use simplex or specialized LP/QP solvers (Section 11.5).
Remarks: Encoding design requirements (actuator limits, stability margins, probability budgets) as convex constraints lets us leverage efficient convex solvers. Feasible set geometry dictates the method: easy projection → projective methods; otherwise → interior-point or operator-splitting.
Linear and conic programming: the canonical models¶
Many practical problems reduce to linear programming (LP) or its convex extensions.
LP and related conic forms are the workhorses of operations research, control, and engineering optimization.
-
Linear programs: standard form
Both objective and constraints are affine, so the optimum lies at a vertex of the polyhedron. Simplex method traverses vertices and is often very fast in practice. Interior-point methods approach the optimum through the interior and have polynomial-time guarantees. For moderate LPs, interior-point is robust and accurate; for very large LPs (sparse, structured), first-order methods or decomposition may be needed. - Quadratic, SOCP, SDP: Convex quadratic programs (QP), second-order cone programs (SOCP), and semidefinite programs (SDP) generalize LP. For example, many robust or regularized problems (elastic net, robust regression, classification with norm constraints) can be cast as QPs or SOCPs. All these are solvable by interior-point (Chapter 9) very efficiently. Interior-point solvers (like MOSEK, Gurobi, etc.) are widely used off-the-shelf for these problem classes.
-
Practical patterns:
- Resource allocation/flow (LP): linear costs and constraints.
- Minimax/regret problems: e.g. \(\min_{x}\max_{i}|a_i^T x - b_i|\) → LP (as in Chebyshev regression).
- Constrained least squares: can be QP or SOCP if constraints are polyhedral or norm-based.
Algorithmic pointers for 11.5: - Moderate LP/QP/SOCP: Use interior-point (robust, yields dual prices) or simplex (fast in practice, warm-startable). - Large-scale LP/QP: Exploit sparsity; use decomposition (Benders, ADMM) if structure allows; use iterative methods (primal-dual hybrid gradient, etc.) for extreme scale. - Reformulate into standard form: Recognize when your problem is an LP/QP/SOCP/SDP to use mature solvers. (E.g. ℓ∞ regression → LP, ℓ2 regression with ℓ2 constraint → SOCP.)
Risk, safety margins, and robust design¶
Modern design often includes risk measures or robustness. Two common patterns:
-
Chance constraints / risk-adjusted objectives E.g. require that \(Pr(\text{loss}(x,\xi) > \tau) \le \delta\). A convex surrogate is to include mean and a multiple of the standard deviation: Algebra often leads to second-order cone constraints (e.g. forcing \(\mathbb{E}\pm \kappa\sqrt{\mathrm{Var}}\) below a threshold). Such problems are SOCPs. Interior-point solvers handle them well (polynomial-time, high accuracy).
-
Worst-case (robust) optimization: Specify an uncertainty set \(\mathcal{U}\) for data (e.g. \(u\) in a norm-ball) and minimize the worst-case cost \(\max_{u\in\mathcal{U}}\ell(x,u)\). Many losses \(\ell\) and convex \(\mathcal{U}\) yield a convex max-term (a support function or norm). The result is often a conic constraint (for ℓ₂ norms, an SOCP; for PSD, an SDP). Solve with interior-point (if problem size permits) or with specialized proximal/ADMM methods (splitting the max-term).
Algorithmic pointers for 11.6:
- Risk/SOCP models: Interior-point (Chapter 9) is the standard approach.
- Robust max-min problems: Convert inner max to a convex constraint (norm or cone). Then use interior-point if the reformulation is conic. If the reformulation is a nonsmooth penalty, use proximal or dual subgradient methods.
- Distributed or iterative solutions: If \(\mathcal{U}\) or loss separable, ADMM can distribute the computation (Chapter 10).
Cheat sheet: If your problem looks like this, use that¶
This summary gives concrete patterns of models and recommended solvers/tricks:
-
(A) Smooth least-squares + ℓ₂:
- Model: \(|Ax-b|_2^2 + \lambda|x|_2^2\).
- Solve: Gradient descent, accelerated gradient, conjugate gradient, or Newton/quasi-Newton. (Strongly convex quadratic ⇒ fast second-order methods.)
-
(B) Sparse regression (ℓ₁):
- Model: \(\tfrac12|Ax-b|_2^2 + \lambda|x|_1\).
- Solve: Proximal gradient (soft-thresholding) or coordinate descent. (Composite smooth+nonsmooth separable.)
-
(C) Robust regression (outliers):
- Models: \(\sum|a_i^T x - b_i|\), Huber loss, etc.
- Solve: Interior-point (LP/SOCP form) for high accuracy; subgradient/proximal (Chapter 9/10) for large data. (Convex but nondifferentiable or conic.)
-
(D) Logistic / log-loss (classification):
- Model: \(\sum[-y_i(w^Ta_i)+\log(1+e^{w^Ta_i})] + \lambda R(w)\) with \(R(w)=|w|_2^2\) or \(|w|_1\).
- Solve:
- If \(R=\ell_2\): use Newton/gradient (smooth, strongly convex).
- If \(R=\ell_1\): use proximal gradient or coordinate descent. (Convex; logistic loss is smooth; ℓ₁ adds nonsmoothness.)
-
(E) Constraints (hard limits):
- Model: \(\min f(x)\) s.t. \(x\in\mathcal{X}\) with \(\mathcal{X}\) simple.
- Solve: Projected (stochastic) gradient or proximal methods if projection \(\Pi_{\mathcal{X}}\) is cheap (e.g. box, ball, simplex). If \(\mathcal{X}\) is complex (second-order or SDP), use interior-point.
-
(F) Separable structure:
- Model: \(\min_{x,z} f(x)+g(z)\) s.t. \(Ax+Bz=c\).
- Solve: ADMM (Chapter 10) – it decouples updates in \(x\) and \(z\); suits distributed or block-structured data.
-
(G) LP/QP/SOCP/SDP:
- Model: linear/quadratic objective with linear/conic constraints.
- Solve: Simplex or interior-point (for moderate sizes). For very large sparse LPs exploit problem structure: warm-starts, decomposition methods (dual/block), or first-order methods (PDHG/ADMM).
-
(H) Nonconvex patterns:
-
Examples: Deep neural networks (nonconvex weights), matrix factorization (bilinear), K-means clustering, mixture models.
-
Solve: There is no single global solver – typically use stochastic gradient (SGD/Adam), alternating minimization (e.g. alternating least squares for matrix factorization), or EM for mixtures. Caveat: Convergence to global optimum is not guaranteed; solutions depend on initialization and may get stuck in local minima. Use regularization, multiple restarts, and heuristics (batch normalization, momentum) as needed.
-
-
(I) Logistic (multi-class softmax):
-
Model: One weight vector per class, convex softmax loss (see Section 11.3).
-
Solve: Similar to binary case – Newton/gradient with L2, or proximal/coordinate with ℓ₁.
-
-
(J) Poisson and count models:
- Model: Negative log-likelihood for Poisson (convex, see Section 11.3).
- Solve: Newton (IRLS) or gradient-based; interior-point can be used after conic reformulation.
Rule of thumb: Identify whether your objective is smooth vs nonsmooth, strongly convex vs just convex, separable vs coupled, constrained vs unconstrained. Then pick from:
- Smooth & strongly convex → (quasi-)Newton or accelerated gradient.
- Smooth + ℓ₁ → Proximal gradient/coordinate.
- Nonsmooth separable → Proximal or coordinate.
- Easy projection constraint → Projected gradient.
- Hard constraints or conic structure → Interior-point.
- Large-scale separable → Stochastic gradient/ADMM.
Convexity guarantees global optimum. When nonconvex (deep nets, latent variables, etc.), we rely on heuristics: SGD, random restarts, and often settle for local minima or approximate solutions.
Matching Model Structure to Algorithm Type¶
| Model Type | Problem Form | Recommended Algorithms | Notes / ML Examples |
|---|---|---|---|
| Smooth unconstrained | \(\min f(x)\) | Gradient descent, Newton, LBFGS | Small to medium problems; logistic regression, ridge regression |
| Nonsmooth unconstrained | \(\min f(x) + R(x)\) | Subgradient, proximal (ISTA/FISTA), coordinate descent | LASSO, hinge loss SVM |
| Equality-constrained | \(\min f(x)\) s.t. \(A x = b\) | Projected gradient, augmented Lagrangian | Constrained least squares, balance conditions |
| Inequality-constrained | \(\min f(x)\) s.t. \(f_i(x)\le 0\) | Barrier, primal–dual, interior-point | Quadratic programming, LPs, constrained regression |
| Separable / block structure | \(\min \sum_i f_i(x_i)\) | ADMM, coordinate updates | Distributed optimization, federated learning |
| Stochastic / large data | \(\min \frac{1}{N}\sum_i f_i(x_i)\) | SGD, SVRG, adaptive variants | Deep learning, online convex optimization |
| Low-rank / matrix structure | \(\min f(X) + \lambda \|X\|_*\) | Proximal (SVD shrinkage), ADMM | Matrix completion, PCA variants |
Chapter 17: Canonical Problems in Convex Optimization¶
Convex optimization encompasses a wide range of problem classes. Despite their diversity, many real-world models reduce to a few canonical forms — each with characteristic geometry, structure, and algorithms.
Hierarchy of Canonical Problems¶
Convex programs form a nested hierarchy:
Each inclusion represents an extension of representational power — from linear to quadratic, to conic, and finally to semidefinite constraints.
Separately, Geometric Programs (GPs) and Maximum Likelihood Estimators (MLEs) form additional convex families after suitable transformations.
| Class | Canonical Form | Key Condition | Typical Algorithms | ML / Applied Examples |
|---|---|---|---|---|
| LP | \(\min_x c^\top x\) s.t. \(A x=b,\,x\ge0\) | Linear constraints | Simplex, Interior-point | Resource allocation, Chebyshev regression |
| QP | \(\min_x \tfrac12 x^\top Q x + c^\top x\) s.t. \(A x\le b\) | \(Q\succeq0\) | Interior-point, Active-set, CG | Ridge, SVM, Portfolio optimization |
| QCQP | \(\min_x \tfrac12 x^\top P_0 x + q_0^\top x\) s.t. \(\tfrac12 x^\top P_i x + q_i^\top x \le0\) | All \(P_i\succeq0\) | Interior-point, SOCP reformulation | Robust regression, trust-region |
| SOCP | \(\min_x f^\top x\) s.t. \(\|A_i x + b_i\|_2 \le c_i^\top x + d_i\) | Cone constraints | Conic interior-point | Robust least-squares, risk limits |
| SDP | \(\min_X \mathrm{Tr}(C^\top X)\) s.t. \(\mathrm{Tr}(A_i^\top X)=b_i\), \(X\succeq0\) | Matrix PSD constraint | Interior-point, low-rank first-order | Covariance estimation, control |
| GP | \(\min_{x>0} f_0(x)\) s.t. \(f_i(x)\le1,\,g_j(x)=1\) | Log-convex after \(y=\log x\) | Log-transform + IPM | Circuit design, power control |
| MLE / GLM | $\min_x -\sum_i \log p(b_i | a_i^\top x)+\mathcal{R}(x)$ | Log-concave likelihood | Newton, L-BFGS, Prox / SGD |
Linear Programming (LP)¶
Form
Geometry: Feasible region = polyhedron; optimum = vertex.
Applications: Resource allocation, shortest path, flow, scheduling.
Algorithms:
- Simplex: walks along edges (vertex-based).
- Interior-point: moves through the interior using log barriers.
- Decomposition: exploits block structure for large LPs.
Quadratic Programming (QP)¶
Form
Geometry: Objective = ellipsoids; feasible = polyhedron.
Examples: Ridge regression, Markowitz portfolio, SVM.
Algorithms:
- Interior-point (smooth path).
- Active-set (edge-following).
- Conjugate Gradient for large unconstrained QPs.
- First-order methods for massive \(n\).
Quadratically Constrained QP (QCQP)¶
Form
Convex if all \(P_i\succeq0\).
Geometry: Intersection of ellipsoids and half-spaces.
Applications: Robust control, filter design, trust-region.
Algorithms: Interior-point (convex case), SOCP / SDP reformulations.
Second-Order Cone Programming (SOCP)¶
Form
Interpretation: Linear objective, norm-bounded constraints.
Applications: Robust regression, risk-aware portfolio, engineering design.
Algorithms: Conic interior-point; scalable ADMM variants.
Special case: Any QP or norm constraint can be written as an SOCP.
Semidefinite Programming (SDP)¶
Form
Meaning: Variable = PSD matrix \(X\); constraints = affine.
Geometry: Feasible region = intersection of affine space with PSD cone.
Applications: Control synthesis, combinatorial relaxations, covariance estimation, matrix completion.
Algorithms: Interior-point for moderate \(n\); low-rank proximal / Frank–Wolfe for large-scale.
Geometric Programming (GP)¶
Original form
where \(f_i\) are posynomials and \(g_j\) monomials.
Log transformation: With \(y=\log x\), the problem becomes convex in \(y\).
Applications: Circuit sizing, communication power control, resource allocation.
Solvers: Convert to convex form → interior-point or primal-dual methods.
Likelihood-Based Convex Models (MLE and GLMs)¶
General form
Examples
| Noise Model | Objective | Equivalent Problem |
|---|---|---|
| Gaussian | \(\|A x - b\|_2^2\) | Least squares |
| Laplacian | \(\|A x - b\|_1\) | Robust regression (LP) |
| Bernoulli | \(\sum_i \log(1+e^{-y_i a_i^\top x})\) | Logistic regression |
| Poisson | \(\sum_i [a_i^\top x - y_i\log(a_i^\top x)]\) | Poisson GLM |
Algorithms
- Newton or IRLS (small–medium).
- Quasi-Newton / L-BFGS (moderate).
- Proximal or SGD (large-scale).
Solver Selection Summary¶
| Problem Type | Convex Form | Key Solvers | ML Examples |
|---|---|---|---|
| LP | Linear | Simplex, Interior-point | Minimax regression |
| QP | Quadratic | Interior-point, CG, Active-set | Ridge, SVM |
| QCQP | Quadratic + constraints | IPM, SOCP / SDP reformulation | Robust regression |
| SOCP | Cone | Conic IPM, ADMM | Robust least-squares |
| SDP | PSD cone | Interior-point, low-rank FW | Covariance, Max-cut relaxations |
| GP | Log-convex | Log-transform + IPM | Power allocation |
| MLE / GLM | Log-concave | Newton, L-BFGS, Prox-SGD | Logistic regression |
Chapter 18: Modern Optimizers in Machine Learning¶
The past decade has seen an explosion of nonconvex optimization problems, driven largely by deep learning. Training neural networks, large language models, and reinforcement learning agents all depend on stochastic optimization—balancing accuracy, generalization, and efficiency on massive, noisy datasets.
This chapter connects the principles of convex optimization to the modern optimizers that power today’s machine learning systems. While these algorithms often lack formal global guarantees, they are remarkably effective in practice.
Stochastic Optimization Overview¶
In machine learning, we often minimize an empirical risk: where \(\ell(x; z_i)\) is the loss on data sample \(z_i\).
Computing the full gradient \(\nabla f(x)\) is infeasible when \(N\) is large. Instead, stochastic methods estimate it using a mini-batch of samples:
This yields the Stochastic Gradient Descent (SGD) update:
SGD is the foundation for nearly all deep learning optimizers.
Momentum and Acceleration¶
SGD’s noisy gradients can cause slow convergence and oscillations. Momentum smooths the update by accumulating a moving average of past gradients:
where \(\beta \in [0,1)\) controls inertia.
Nesterov momentum adds a correction term anticipating the future position:
Momentum-based methods help traverse ravines and saddle regions efficiently.
Adaptive Learning Rate Methods¶
Different parameters often require different step sizes.
Adaptive methods adjust learning rates automatically using the history of squared gradients.
AdaGrad¶
Keeps a cumulative sum of squared gradients:
and updates parameters as:
Good for sparse data, but the learning rate can shrink too quickly.
RMSProp¶
A refinement of AdaGrad using exponential averaging:
RMSProp prevents the learning rate from vanishing and works well for nonstationary objectives.
Adam: Adaptive Moment Estimation¶
Adam combines momentum and adaptive scaling:
Adam adapts quickly to changing gradient scales, converging faster than vanilla SGD.
Variants and Modern Extensions¶
| Optimizer | Key Idea | Notes |
|---|---|---|
| AdamW | Decoupled weight decay from gradient update | Better regularization |
| RAdam | Rectified Adam—adaptive variance correction | Improves stability early in training |
| Lookahead | Combines fast and slow weights | Enhances robustness and convergence |
| AdaBelief | Uses prediction error instead of raw gradient variance | More adaptive learning rates |
| Lion | Uses sign-based updates and momentum | Efficient for large-scale training |
These variants represent the frontier of stochastic optimization in deep learning frameworks.
Implicit Regularization and Generalization¶
Modern optimizers not only minimize loss—they also affect generalization. SGD and its variants exhibit implicit bias toward flat minima, which often correspond to models with better generalization properties.
Empirical findings suggest:
- Large-batch training finds sharper minima (risk of overfitting).
- Noisy, small-batch SGD promotes flat, generalizable minima.
- Adaptive optimizers may converge faster but generalize slightly worse.
This trade-off drives ongoing research into optimizer design.
Practical Considerations¶
| Aspect | Guideline |
|---|---|
| Learning Rate | Most critical hyperparameter; use warm-up and decay schedules |
| Batch Size | Balances gradient noise and hardware efficiency |
| Initialization | Affects early dynamics, especially for Adam variants |
| Gradient Clipping | Prevents instability in exploding gradients |
| Mixed Precision | Use with adaptive optimizers for speed and memory savings |
Comparative Behavior¶
| Method | Adaptivity | Speed | Memory | Typical Use |
|---|---|---|---|---|
| SGD + Momentum | Moderate | Slow-medium | Low | General-purpose, good generalization |
| RMSProp | Adaptive per-parameter | Medium-fast | Medium | Recurrent networks, nonstationary data |
| Adam / AdamW | Fully adaptive | Fast | High | Deep networks, large-scale training |
| RAdam / AdaBelief / Lion | Advanced adaptivity | Fast | Medium | Cutting-edge training tasks |
Optimization in Modern Deep Networks¶
In deep learning, optimization interacts with architecture, loss, and regularization:
- Batch normalization modifies effective learning rates.
- Skip connections ease gradient flow.
- Large-scale distributed training relies on adaptive optimizers for stability.
Optimization is no longer an isolated procedure but part of the model’s design philosophy.
Modern stochastic optimizers extend classical first-order methods into high-dimensional, noisy, nonconvex regimes. They are the engines behind deep learning—adapting dynamically, balancing efficiency and generalization.
Chapter 19: Beyond Convexity – Nonconvex and Global Optimization¶
Optimization extends far beyond the comfortable world of convexity. In practice, most problems in machine learning, signal processing, control, and engineering design are nonconvex: their objective functions have multiple valleys, peaks, and saddle points. Convex optimization gives us strong guarantees; every local minimum is global, and algorithms converge predictably. But the moment convexity is lost, these guarantees vanish, and new techniques become necessary.
The Landscape of Nonconvex Optimization¶
A nonconvex function \(f:\mathbb{R}^n \to \mathbb{R}\) violates convexity; i.e., for some \(x, y\) and \(\theta \in (0,1)\), Its level sets can fold, twist, and fragment, creating local minima, local maxima, and saddle points scattered throughout the space.
A typical nonconvex landscape looks like a mountainous terrain: smooth in some regions, rugged in others. An optimization algorithm’s path depends strongly on initialization and stochastic effects.
Example: A Simple Nonconvex Function¶
This function has multiple stationary points:
- \((0,0)\) (a saddle),
- \((1,1)\) and \((-1,-1)\) (local minima),
- \((1,-1)\) and \((-1,1)\) (local maxima).
Unlike convex problems, gradient descent may end in different minima depending on where it starts.
Local vs. Global Minima¶
A point \(x^*\) is a local minimum if:
A global minimum satisfies the stronger condition:
In convex problems, every local minimum is automatically global. In nonconvex problems, local minima can be arbitrarily bad and there may be exponentially many of them.
Classes of Nonconvex Problems¶
Nonconvex problems appear in several distinct forms:
| Type | Example | Challenge |
|---|---|---|
| Smooth nonconvex | Neural network training | Multiple minima, saddle points |
| Nonsmooth nonconvex | Sparse regularization, ReLU activations | Undefined gradients |
| Discrete / combinatorial | Scheduling, routing, integer programs | Exponential search space |
Each category requires different algorithmic strategies: from stochastic gradient methods to evolutionary heuristics or surrogate modeling.
Local Optimization Strategies¶
Even in nonconvex settings, local optimization remains useful when:
- The problem is nearly convex (e.g., locally convex around good minima),
- The initialization is close to a desired basin of attraction,
- Or the goal is approximate, not exact, optimality.
Gradient Descent and Its Variants¶
Gradient descent behaves well if \(f\) is smooth and Lipschitz-continuous: However, convergence is only to a stationary point; not necessarily a minimum. Escaping saddles: Adding small random noise (stochasticity) helps escape flat saddle regions common in high-dimensional problems.
Global Optimization Strategies¶
To seek the global minimum, algorithms must explore the search space more broadly.
Common strategies include:
-
Multiple Starts: Run local optimization from diverse random initial points and keep the best solution.
-
Continuation and Homotopy Methods: Start from a smooth, convex approximation \(f_\lambda\) of \(f\) and gradually transform it into the true objective as \(\lambda \to 0\).
-
Stochastic Search and Simulated Annealing: Introduce randomness in updates to jump between basins.
-
Population-Based Methods: Maintain a swarm or population of candidate solutions evolving by selection and variation — leading to metaheuristic algorithms like Genetic Algorithm and Particle Swarm Optimization.
Deterministic vs. Stochastic Global Methods¶
| Deterministic Methods | Stochastic Methods |
|---|---|
| Systematic exploration of space (branch & bound, interval analysis) | Randomized search (simulated annealing, evolutionary algorithms) |
| Can provide certificates of global optimality | Typically approximate but scalable |
| High computational cost | Naturally parallelizable |
In real-world large-scale problems, stochastic global optimization is often the only feasible approach.
A Taxonomy of Optimization Beyond Convexity¶
| Family | Typical Algorithms | When to Use |
|---|---|---|
| Derivative-Free (Black-Box) | Nelder–Mead, CMA-ES, Bayesian Opt. | When gradients unavailable |
| Metaheuristic (Evolutionary) | GA, PSO, DE, ACO | Complex landscapes, combinatorial problems |
| Modern Stochastic Gradient | Adam, RMSProp, Lion | Deep learning, large-scale models |
| Combinatorial / Discrete | Branch & Bound, Tabu, SA | Integer or graph-based problems |
| Learning-Based Optimizers | Meta-learning, Reinforcement methods | Adaptive, data-driven optimization |
Chapter 20: Derivative-Free and Black-Box Optimization¶
In many optimization problems, gradients are unavailable, unreliable, or prohibitively expensive to compute. Examples include tuning hyperparameters of machine learning models, engineering design through simulation, or optimizing physical experiments. Such problems fall under the class of derivative-free or black-box optimization methods.
Unlike gradient-based methods, which rely on analytical or automatic differentiation, derivative-free algorithms make progress solely from function evaluations.
Motivation and Challenges¶
Let \(f: \mathbb{R}^n \to \mathbb{R}\) be an objective function.
A derivative-free algorithm seeks to minimize \(f(x)\) using only evaluations of \(f(x)\), without access to \(\nabla f(x)\) or \(\nabla^2 f(x)\).
Key challenges:
- No gradient information → difficult to infer descent directions.
- Expensive evaluations → every call to \(f(x)\) might require a simulation or experiment.
- Noise and stochasticity → evaluations may be corrupted by measurement or sampling error.
- High-dimensionality → sampling-based methods scale poorly with \(n\).
Derivative-free optimization is thus a trade-off between exploration and exploitation, guided by heuristics or surrogate models.
Classification of Derivative-Free Methods¶
| Category | Representative Algorithms | Main Idea |
|---|---|---|
| Direct Search | Nelder–Mead, Pattern Search, MADS | Explore the space via geometric moves or meshes |
| Model-Based | BOBYQA, Trust-Region DFO | Build local quadratic or surrogate models of \(f\) |
| Evolutionary / Population-Based | CMA-ES, Differential Evolution | Evolve a population using stochastic operators |
| Probabilistic / Bayesian | Bayesian Optimization | Use probabilistic surrogate models to guide exploration |
Direct Search Methods¶
Direct search algorithms evaluate the objective function at structured sets of points and use comparisons, not gradients, to decide where to move.
Nelder–Mead Simplex Method¶
Perhaps the most famous derivative-free algorithm, Nelder–Mead maintains a simplex: a polytope of \(n+1\) vertices in \(\mathbb{R}^n\).
At each iteration:
- Evaluate \(f\) at all simplex vertices.
- Reflect, expand, contract, or shrink the simplex depending on performance.
- Continue until simplex collapses near a minimum.
Simple, intuitive, and effective for small-scale smooth problems, though it lacks formal convergence guarantees in general.
Pattern Search Methods¶
These methods (also called coordinate search or compass search) probe the function along coordinate directions or pre-defined patterns.
Typical update rule:
where \(d_i\) is a direction from a finite set (e.g., coordinate axes).
If a direction yields improvement, move there; otherwise, shrink \(\Delta_k\).
Mesh Adaptive Direct Searcs: MADS refines pattern search by maintaining a mesh of candidate points and adaptively changing its resolution. It offers provable convergence to stationary points for certain classes of nonsmooth problems.
Model-Based Methods¶
Instead of exploring blindly, model-based methods construct an approximation of the objective function from past evaluations.
Trust-Region DFO¶
A local model \(m_k(x)\) (often quadratic) is built to approximate \(f\) near the current iterate \(x_k\): The next iterate solves a trust-region subproblem: The trust region size \(\Delta_k\) adapts based on how well \(m_k\) predicts true function values.
Bound Optimization BY Quadratic Approximation: BOBYQA builds and maintains a quadratic model using interpolation of previously evaluated points. It is highly efficient for medium-scale problems with simple box constraints and no noise.
Bayesian Optimization¶
Model the objective as a random function \(f(x) \sim \mathcal{GP}(m(x), k(x,x'))\) (Gaussian Process prior). After each evaluation, update the posterior mean and variance to quantify uncertainty.
Use an acquisition function \(a(x)\) to select the next evaluation point: balancing exploration (high uncertainty) and exploitation (low expected value).
Common acquisition functions:
- Expected Improvement (EI)
- Probability of Improvement (PI)
- Upper Confidence Bound (UCB)
Surrogate Models Beyond Gaussian Processes¶
When dimensionality is high or data is noisy, other surrogate models may replace GPs: - Tree-structured Parzen Estimators (TPE) - Random forests (SMAC) - Neural network surrogates (Bayesian neural networks)
These variants enable Bayesian optimization in complex or discrete search spaces.
Hybrid and Adaptive Approaches¶
Modern applications often combine derivative-free and gradient-based techniques:
- Use Bayesian optimization for coarse global search, then local refinement with gradient descent.
- Alternate between CMA-ES and SGD to exploit both exploration and fast convergence.
- Apply direct search methods to tune hyperparameters of differentiable optimizers.
Such hybridization reflects a pragmatic view: no single optimizer is best — adaptability matters most.
Practical Considerations¶
| Aspect | Guideline |
|---|---|
| Function evaluations expensive | Use Bayesian or model-based methods |
| Noisy evaluations | Use averaging, smoothing, or robust estimators |
| High dimension (\(n > 50\)) | Prefer CMA-ES or evolutionary strategies |
| Box constraints | Methods like BOBYQA, DE, or PSO |
| Parallel computation available | Population-based methods excel |
Derivative-free optimization expands our toolkit beyond calculus, allowing us to optimize anything we can evaluate. It emphasizes adaptation, surrogate modeling, and population intelligence rather than analytical structure.
In the next chapter, we explore metaheuristic and evolutionary algorithms, which generalize these ideas further by mimicking natural and collective behaviors; turning randomness into a powerful search strategy.
Chapter 21: Metaheuristic and Evolutionary Algorithms¶
When optimization problems are highly nonconvex, discrete, or black-box, deterministic methods often fail to find good solutions. In these settings, metaheuristic algorithms—inspired by nature, biology, and collective behavior—provide robust and flexible alternatives. Metaheuristics are general-purpose stochastic search methods that rely on repeated sampling, adaptation, and survival of the fittest ideas. They are especially effective when the landscape is rugged, multimodal, or not well understood.
Principles of Metaheuristic Optimization¶
All metaheuristics share three key principles:
-
Population-Based Search: Maintain multiple candidate solutions simultaneously to explore diverse regions of the search space.
-
Variation Operators:Create new solutions via mutation, recombination, or stochastic perturbations.
-
Selection and Adaptation:Favor candidates with better objective values, guiding the search toward promising regions.
Unlike local methods, metaheuristics balance exploration (global search) and exploitation (local refinement).
Genetic Algorithms¶
Genetic Algorithms mimic natural evolution, where populations evolve toward higher fitness through selection, crossover, and mutation.
Representation¶
A solution (individual) is represented as a chromosome, often a binary string, vector of reals, or permutation. Each position (gene) encodes part of the decision variable.
Algorithm Outline¶
- Initialize a population \(\{x_i\}_{i=1}^N\) randomly.
- Evaluate fitness \(f(x_i)\) for all individuals.
- Select parents based on fitness (e.g., tournament or roulette-wheel selection).
-
Apply:
- Crossover: combine genetic material of two parents.
- Mutation: randomly alter some genes to maintain diversity.
-
Form a new population and repeat until convergence.
Crossover and Mutation Examples¶
- Single-point crossover: exchange genes after a random index.
- Gaussian mutation: add small noise to continuous parameters.
| Strengths | Weaknesses |
|---|---|
| Highly parallel, robust, domain-independent | Requires many function evaluations |
| Effective for combinatorial and discrete optimization | Parameter tuning (mutation, crossover rates) is nontrivial |
Differential Evolution¶
Differential Evolution is a simple yet powerful algorithm for continuous optimization. Mutation is performed using differences of population members:
where \(r1, r2, r3\) are random distinct indices and \(F \in [0,2]\) controls mutation amplitude.
Then crossover forms trial vectors: and selection chooses between \(x_i\) and \(u_i\) based on objective value.
Features¶
- Self-adaptive exploration of the search space.
- Suitable for continuous, multimodal functions.
- Simple to implement, with few control parameters.
Particle Swarm Optimization¶
Inspired by social behavior of birds and fish, Particle Swarm Optimization maintains a swarm of particles moving through the search space. Each particle \(i\) has position \(x_i\) and velocity \(v_i\), updated as: where:
- \(p_i\) = personal best position of particle \(i\),
- \(g\) = best global position found by the swarm,
- \(w\), \(c_1\), \(c_2\) are weight and learning coefficients,
- \(r_1\), \(r_2\) are random numbers in \([0,1]\).
Particles balance individual learning (self-experience) and social learning (group knowledge).
Simulated Annealing¶
Simulated Annealing is one of the earliest and most fundamental stochastic optimization algorithms. It is inspired by annealing in metallurgy, a physical process in which a material is heated and then slowly cooled to minimize structural defects and reach a low-energy crystalline state. The key idea is to imitate this gradual “cooling” in the search for a global minimum.
Physical Analogy¶
In thermodynamics, a system at temperature \(T\) has probability of occupying a state with energy \(E\) given by the Boltzmann distribution:
At high temperature, the system freely explores many states. As \(T\) decreases, it becomes increasingly likely to remain near states of minimal energy.
Simulated Annealing maps this principle to optimization by treating:
- The objective function \(f(x)\) as the system’s energy.
- The solution vector \(x\) as a configuration.
- The temperature \(T\) as a control parameter determining randomness.
Algorithm Outline¶
-
Initialization
- Choose an initial solution \(x_0\) and initial temperature \(T_0\).
- Set a cooling schedule \(T_{k+1} = \alpha T_k\), with \(\alpha \in (0,1)\).
-
Iteration
- Generate a candidate \(x'\) from \(x_k\) via a small random perturbation.
- Compute \(\Delta f = f(x') - f(x_k)\).
- Accept or reject based on the Metropolis criterion:
-
Cooling
-
Reduce the temperature gradually according to the schedule.
-
Repeat until \(T\) becomes sufficiently small or the system stabilizes.
-
-
At high temperatures, SA accepts both better and worse moves → exploration.
-
At low temperatures, it becomes increasingly selective → exploitation.
This balance allows SA to escape local minima and approach the global optimum over time.
The temperature schedule determines convergence quality:
| Type | Formula | Behavior |
|---|---|---|
| Exponential | \(T_{k+1} = \alpha T_k\) | Simple, widely used |
| Linear | \(T_{k+1} = T_0 - \beta k\) | Faster cooling, less exploration |
| Logarithmic | \(T_k = \frac{T_0}{\log(k + c)}\) | Theoretically convergent (slow) |
| Adaptive | Adjust based on recent acceptance rates | Practical and self-tuning |
A slower cooling schedule improves accuracy but increases computational cost.
Exploration vs. Exploitation¶
Every metaheuristic must balance:
- Exploration: sampling diverse regions to escape local minima.
- Exploitation: refining known good solutions to reach local optima.
| High Exploration | High Exploitation |
|---|---|
| GA with strong mutation | PSO with low inertia |
| DE with high \(F\) | ACO with low evaporation rate |
| Random restarts | Local refinement |
Adaptive control of parameters (e.g., mutation rate, inertia weight) helps maintain balance dynamically.
Performance and Practical Tips¶
| Aspect | Guideline |
|---|---|
| Initialization | Use wide, random distributions to promote diversity |
| Parameter Tuning | Use adaptive schedules (e.g., cooling, inertia decay) |
| Population Size | Larger for global search, smaller for fine-tuning |
| Parallelism | Evaluate populations concurrently for efficiency |
| Stopping Criteria | Use both iteration limits and stagnation detection |
Metaheuristics are heuristic by design — they do not guarantee global optimality, but offer practical success across many fields. Metaheuristic and evolutionary algorithms transform optimization into a process of adaptation and learning. Through populations, randomness, and natural analogies, they enable search in landscapes too complex for calculus or convexity.
Chapter 22: Advanced Topics in Combinatorial Optimization¶
In many of the most challenging optimization problems, variables are discrete, decisions are binary or integral, and the underlying structure is inherently combinatorial. Convex analysis gives way to graph theory, integer programming, and search algorithms built on discrete mathematics. Combinatorial optimization lies at the intersection of mathematics, computer science, and operations research, offering powerful tools for scheduling, routing, allocation, and design problems.
Nature of Combinatorial Problems¶
A combinatorial optimization problem can be expressed as:
where \(\mathcal{F}\) is a finite or countable set of feasible solutions, often exponentially large in size.
Example forms include:
- Binary decisions: \(x_i \in \{0,1\}\)
- Integer constraints: \(x_i \in \mathbb{Z}\)
- Permutations: ordering or ranking elements
Unlike convex problems, feasible regions are discrete, and local moves must be designed carefully to explore the combinatorial space.
Graph-Theoretic Foundations¶
Many combinatorial problems are naturally represented as graphs \(G = (V, E)\).
Shortest Path Problem¶
Given edge weights \(w_{ij}\), find a path from \(s\) to \(t\) minimizing total weight: Efficiently solvable by Dijkstra’s or Bellman–Ford algorithms.
Minimum Spanning Tree (MST)¶
Find a subset of edges connecting all vertices with minimal total weight. Solved by Kruskal’s or Prim’s algorithm in \(O(E\log V)\) time.
Maximum Flow / Minimum Cut¶
Determine how much “flow” can be sent through a network subject to capacity limits. Duality connects max-flow and min-cut, linking graph algorithms to convex duality principles.
Integer Linear Programming (ILP)¶
An integer program seeks:
It generalizes many classical problems:
- Knapsack
- Assignment
- Scheduling
- Facility location
Relaxing \(x \in \mathbb{Z}^n\) to \(x \in \mathbb{R}^n\) yields a linear program (LP) that can be solved efficiently and provides a lower bound.
Relaxation and Rounding¶
A central idea is to solve a relaxed convex problem, then round its solution to a discrete one.
LP Relaxation¶
For binary variables \(x_i \in \{0,1\}\), relax to \(0 \le x_i \le 1\) and solve via simplex or interior-point methods.
Semidefinite Relaxation¶
For quadratic binary problems, lift to a positive semidefinite matrix \(X = xx^\top\): Semidefinite relaxations are powerful in problems like MAX-CUT and clustering.
Randomized Rounding¶
Map fractional solutions back to integers probabilistically, preserving expected properties.
Branch-and-Bound and Search Trees¶
Exact combinatorial optimization often relies on enumeration enhanced by bounding.
Basic Principle¶
- Partition the feasible set into subsets (branching).
- Compute upper/lower bounds for each subset.
- Prune branches that cannot contain the optimum.
The algorithm systematically explores a search tree, guided by bounds.
Bounding via Relaxations¶
LP or convex relaxations provide efficient lower bounds, greatly reducing the search space.
Dynamic Programming¶
Dynamic programming (DP) decomposes a problem into overlapping subproblems:
It is exact but can suffer from exponential growth (“curse of dimensionality”).
Applications:
- Shortest paths
- Sequence alignment
- Knapsack
- Resource allocation
DP offers exact solutions when structure allows sequential decomposition.
Heuristics and Metaheuristics for Combinatorial Problems¶
When exact methods become intractable, we turn to approximation and stochastic search.
Greedy Heuristics¶
Make locally optimal choices at each step (e.g., nearest neighbor in TSP, Kruskal’s MST). Fast but not always globally optimal.
Local Search and Hill Climbing¶
Iteratively improve a current solution by small perturbations (e.g., swap two items, reassign a job). Can be trapped in local minima.
Metaheuristic Extensions¶
- Simulated Annealing: controlled random acceptance of worse moves.
- Tabu Search: memory-based diversification.
- Ant Colony Optimization: probabilistic path construction.
- Genetic Algorithms and PSO: population-based evolution.
These approaches generalize to discrete structures with minimal problem-specific design.
Approximation Algorithms¶
Some combinatorial problems are provably intractable but allow approximation guarantees: where \(\alpha \ge 1\) is the approximation ratio.
Examples:
- Greedy Set Cover: \(\alpha = \ln n + 1\)
- Christofides’ Algorithm for TSP: \(\alpha = 1.5\)
- MAX-CUT SDP Relaxation: \(\alpha \approx 0.878\)
Approximation theory blends combinatorics with convex relaxation insights.
Combinatorial optimization embodies the art of solving discrete, structured problems where convexity no longer applies. It draws from graph theory, algebra, logic, and probabilistic reasoning. Relaxation and approximation techniques build a bridge between the continuous and the discrete, uniting convex and combinatorial worlds.



